Closed albertisfu closed 1 year ago
extractions were always failing
Is that true for the version that's deployed or just for the dev version?
This is building and deploying now. Looks good to me.
Is that true for the version that's deployed or just for the dev version?
The bug was present only in the dev version. The last doctor image that was built worked fine. Seems that PyPDF2
was recently updated, so when building a new image the latest version of PyPDF2
was used which produced the problem.
Whew! Thanks.
The problem here was the regex to detect if a PDF contains images.
The original version only considered a PDF image description in a single line like:
/Type /XObject /Subtype /Image /Width 1700
But the PDFs that were missing from OCR have the following structure with line breaks:
So in order to solve it I changed the regex to
/Image ?
so that it detects both versions.I had some problems reproducing the issue since extractions were always failing. Then I realized that the last version of
PyPDF2
deprecated its methods and changed to new ones, so I update the code to use the new ones.Added a comment in
DEVELOPING.md
to remember devs to setDEBUG
toTrue
in order to see debug logs and detect errors like the one described above.Fixed the lint action, related to a not found python version.