fourdigits / wagtail_textract

Text extraction for Wagtail document search
BSD 3-Clause "New" or "Revised" License
33 stars 13 forks source link

Textract runs when document file hasn't updated #7

Open kaedroho opened 6 years ago

kaedroho commented 6 years ago

Currently, it appears there's no check for whether the file has actually changed before rerunning textract so it probably reruns even if the user has only updated the title.

@gasman and I were discussing adding file hashing to Wagtail Images/Documents for cache-busting but might help solve this issue too.

khink commented 6 years ago

You are right, it always runs, even if only title/tags were changed. No way to tell in save() what was updated, so yes, a hash would come in handy.

allcaps commented 6 years ago

Maybe you can query the current object in save method and inspect if the file field has changed. https://stackoverflow.com/questions/12461100/django-detecting-changed-imagefield-with-same-filename

khink commented 6 years ago

@kaedroho Is there an issue in Wagtail for file hashing, is it likely that it'll be in wagtail some time soon?

kaedroho commented 6 years ago

Yep, there's a PR here: https://github.com/wagtail/wagtail/pull/4526

zerolab commented 6 years ago

The document file hash will make it in Wagtail 2.4. So this could probably be looked at with the latest Wagtail master