UB-Mannheim / zotero-ocr

Zotero Plugin for OCR
GNU Affero General Public License v3.0
547 stars 40 forks source link

Pages in Image-list.txt are not guaranteed to be in order. #76

Closed callisto-jovy closed 3 months ago

callisto-jovy commented 4 months ago

After finding that a lot of my OCR'd PDFs had pages that were out of oder, I checked the output generated by the plugin and found this:

Example of the output generated:

/Users/username/Zotero/storage/2GESJB9C/page-15.png
/Users/username/Zotero/storage/2GESJB9C/page-01.png
/Users/username/Zotero/storage/2GESJB9C/page-14.png
/Users/username/Zotero/storage/2GESJB9C/page-02.png
/Users/username/Zotero/storage/2GESJB9C/page-16.png
/Users/username/Zotero/storage/2GESJB9C/page-17.png
...

As you can clearly see, the pages are completely out of order. The issue is caused by iterating over the directorie's contents and simply performing a pattern-match. This, however, does not guarantee for the items to be input into the array in order. Now there are two possibilities: Sort the array after the fact or populate the array while extracting the PNGs. If you'd like me to, I will open a PR to resolve this issue.

I'm just surprised that seemingly noone has had this issue before me.

stweil commented 4 months ago

As long as the new png files are created in the right order, reading the directory will typically return the same order. But you are right, there is no gurantee. That only might explain why it does not occur often and why nobody reported it up to now.

A PR is very welcome.

stweil commented 4 months ago

It looks like src/chrome/content/zoteroocr.js for Zotero 6 uses a different implementation which should work fine, so only src/zotero-ocr.js for Zotero 7 needs a patch (either sort the directory entries or use the same code as for Zotero 6).

aborel commented 4 months ago

Thanks for the PR! I don't have time to review right now, and there are more changes than I would like due to different JS style conventions, but the proposed logic is probably fine. If another admin wants to accept the PR as is and deploy the bugfix ASAP, I can live with that.

callisto-jovy commented 4 months ago

As long as the new png files are created in the right order, reading the directory will typically return the same order. But you are right, there is no gurantee. That only might explain why it does not occur often and why nobody reported it up to now.

A PR is very welcome.

You're right, at any rate, the PNGs should be created one-by-one. Seems like IOUtils.getChildren isn't concerned with any order. Might also differ based on the platform-implementation.

aaajjjsss commented 4 months ago

Hiya, Wondering if anyone has any insights on how to fix this? I too am finding pdf pages out of order when using the plugin with Zotero 7. It worked so well with Zotero 6. I am very out of my element with all of this stuff, so apologies if the fix is noted above and I just can't understand it.

callisto-jovy commented 4 months ago

Hiya, Wondering if anyone has any insights on how to fix this? I too am finding pdf pages out of order when using the plugin with Zotero 7. It worked so well with Zotero 6. I am very out of my element with all of this stuff, so apologies if the fix is noted above and I just can't understand it.

There is do direct fix for this behaviour, you'd have to apply the code changes I submitted in #77 and rebuild the plugin yourself. The easiest way -- if you are not tech-savy -- would be to fork the repo on my profile and just follow the instructions in the README on how to build and install the plugin.

aborel commented 3 months ago

I have pushed a simpler fix (I hope) for this problem. @callisto-jovy could you test https://github.com/UB-Mannheim/zotero-ocr/commit/28474b5ff59d5061587f57c2540b540aa6f296bb and tell us if it works on your system?

stweil commented 3 months ago

@aaajjjsss, maybe you also want to test whether aborel's latest fix works for you and report your results here before I create a bugfix release.

aaajjjsss commented 3 months ago

It works beautifully! Thank you all so much. :)

stweil commented 3 months ago

The issue should be fixed with our new release 0.7.3.