freelawproject / doctor

A microservice for document conversion at scale
https://free.law/projects/doctor
BSD 2-Clause "Simplified" License
57 stars 15 forks source link

139 Fixes regex to detect images in PDF files #156

Closed albertisfu closed 1 year ago

albertisfu commented 1 year ago

The problem here was the regex to detect if a PDF contains images.

The original version only considered a PDF image description in a single line like: /Type /XObject /Subtype /Image /Width 1700

But the PDFs that were missing from OCR have the following structure with line breaks:

/Length 118
/Subtype 
/Image
/Width 1700

So in order to solve it I changed the regex to /Image ? so that it detects both versions.

I had some problems reproducing the issue since extractions were always failing. Then I realized that the last version of PyPDF2 deprecated its methods and changed to new ones, so I update the code to use the new ones.

mlissner commented 1 year ago

extractions were always failing

Is that true for the version that's deployed or just for the dev version?

mlissner commented 1 year ago

This is building and deploying now. Looks good to me.

albertisfu commented 1 year ago

Is that true for the version that's deployed or just for the dev version?

The bug was present only in the dev version. The last doctor image that was built worked fine. Seems that PyPDF2 was recently updated, so when building a new image the latest version of PyPDF2 was used which produced the problem.

mlissner commented 1 year ago

Whew! Thanks.