139 Fixes regex to detect images in PDF files

albertisfu commented 1 year ago

The problem here was the regex to detect if a PDF contains images.

The original version only considered a PDF image description in a single line like: /Type /XObject /Subtype /Image /Width 1700

But the PDFs that were missing from OCR have the following structure with line breaks:

/Length 118
/Subtype 
/Image
/Width 1700

So in order to solve it I changed the regex to /Image ? so that it detects both versions.

Added a test considering a PDF with this kind of structure

I had some problems reproducing the issue since extractions were always failing. Then I realized that the last version of PyPDF2 deprecated its methods and changed to new ones, so I update the code to use the new ones.

Added a comment in DEVELOPING.md to remember devs to set DEBUG to True in order to see debug logs and detect errors like the one described above.
Fixed the lint action, related to a not found python version.

mlissner commented 1 year ago

extractions were always failing

Is that true for the version that's deployed or just for the dev version?

mlissner commented 1 year ago

This is building and deploying now. Looks good to me.

albertisfu commented 1 year ago

Is that true for the version that's deployed or just for the dev version?

The bug was present only in the dev version. The last doctor image that was built worked fine. Seems that PyPDF2 was recently updated, so when building a new image the latest version of PyPDF2 was used which produced the problem.

mlissner commented 1 year ago

Whew! Thanks.

freelawproject / doctor

139 Fixes regex to detect images in PDF files #156