PlaidWeb / Publ

Flexible publishing system for the web
http://publ.beesbuzz.biz/
MIT License
40 stars 4 forks source link

Duplicate database entries for the same asset file #537

Closed fluffy-critter closed 1 year ago

fluffy-critter commented 1 year ago

Expected Behavior

Asset files should only ever be indexed with their canonical, relative path

Current Behavior

Sometimes an asset file gets reindexed with its full path, rather than its local-relative path, and this ends up causing multiple asset entries with the same identifier, which then results in an exception at retrieval time:

INFO:Thread-343 (process_request_thread):werkzeug:127.0.0.1 - - [20/Aug/2023 12:39:38] "GET /_file/0acf5/20230820%20basement%20layout%20aspiration.svg HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/flask/app.py", line 2213, in __call__
    return self.wsgi_app(environ, start_response)
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/flask/app.py", line 2193, in wsgi_app
    response = self.handle_exception(e)
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "<string>", line 2, in retrieve_asset
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/pony/orm/core.py", line 519, in new_func
    result = func(*args, **kwargs)
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/publ/rendering.py", line 594, in retrieve_asset
    record = model.Image.get(asset_name=filename)
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/pony/orm/core.py", line 4007, in get
    try: return entity._find_one_(kwargs)  # can throw MultipleObjectsFoundError
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/pony/orm/core.py", line 4114, in _find_one_
    if obj is None: obj = entity._find_in_db_(avdict, unique, for_update, nowait, skip_locked)
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/pony/orm/core.py", line 4174, in _find_in_db_
    objects = entity._fetch_objects(cursor, attr_offsets, 1, for_update, avdict)
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/pony/orm/core.py", line 4294, in _fetch_objects
    if max_fetch_count == 1: throw(MultipleObjectsFoundError,
  File "/Users/fluffy/Library/Caches/pypoetry/virtualenvs/beesbuzz-biz-NFomSJ_p-py3.10/lib/python3.10/site-packages/pony/utils/utils.py", line 99, in throw
    raise exc
pony.orm.core.MultipleObjectsFoundError: Multiple objects were found. Use Image.select(...) to retrieve them

When the database is in this state it has duplicate asset entries e.g.:

sqlite> select * from Image where file_path like '%svg%';
[...]
content/blog/20230820 basement layout current.svg|6c100e17a1d34e08a52873625ac9419c|103247640,1692556747.046811,145966,1||||1|6c100/20230820 basement layout current.svg
content/blog/20230820 basement layout aspiration.svg|0acf5a76327e03754eb2a369e0f74828|103247666,1692556765.9967458,16293,1||||1|0acf5/20230820 basement layout aspiration.svg
/Users/fluffy/Documents/beesbuzz.biz/content/blog/20230820 basement layout current.svg|6c100e17a1d34e08a52873625ac9419c|103247640,1692556747.046811,145966,1||||1|6c100/20230820 basement layout current.svg
/Users/fluffy/Documents/beesbuzz.biz/content/blog/20230820 basement layout aspiration.svg|0acf5a76327e03754eb2a369e0f74828|103247666,1692556765.9967458,16293,1||||1|0acf5/20230820 basement layout aspiration.svg

Possible Solution

At the very least, the asset identifier needs to have a uniqueness constraint on its index. This will also help to track down how this full-path asset location is leaking into the content indexer in the first place.

Steps to Reproduce (for bugs)

1. 2. 3. 4.

Context

fluffy-critter commented 1 year ago

asset_path can be None (and usually is for actual images) so there's no way to put a uniqueness constraint on the index. It's unclear what sometimes causes the full path to be registered with the database. Perhaps a better fix would be to store asset paths in their own separate table.