freelawproject / doctor

A microservice for document conversion at scale
https://free.law/projects/doctor
BSD 2-Clause "Simplified" License
54 stars 14 forks source link

Fix file extension being identified as .bin #167

Closed quevon24 closed 1 year ago

quevon24 commented 1 year ago

This issue is related to: https://github.com/freelawproject/courtlistener/issues/2688

The problem is because the Seventh Circuit of Appeals in the United States is adding additional information to PDF files, the problem is that their documents that do not comply with the PDF specification, adding new information:

image

You can see the pdf file here: http://media.ca7.uscourts.gov/cgi-bin/OpinionsWeb/processWebInputExternal.pl?Submit=Display&Path=Y2023/D04-27/C:22-2500:J:Brennan:aut:T:fnOp:N:3036932:S:0

This causes python-magic to be unable to identify the correct content type(application/octet-stream instead of application/pdf) and therefore not to detect the file extension correctly.

To solve this i updated the microservice that takes care of detecting file extensions by selecting the first 1024 bytes and looking for the pdf version using a regex to match "%PDF-X.X" where X.X is the version, e.g. %PDF=1.6

mlissner commented 1 year ago

Great! I made one tweak to add the link to the issue, but otherwise this looks great. Now to figure out how to deploy it!

mlissner commented 1 year ago

OK, it auto builds and deploys. How lovely. Moving on. Thank you Kevin!