The problem is because the Seventh Circuit of Appeals in the United States is adding additional information to PDF files, the problem is that their documents that do not comply with the PDF specification, adding new information:
This causes python-magic to be unable to identify the correct content type(application/octet-stream instead of application/pdf) and therefore not to detect the file extension correctly.
To solve this i updated the microservice that takes care of detecting file extensions by selecting the first 1024 bytes and looking for the pdf version using a regex to match "%PDF-X.X" where X.X is the version, e.g. %PDF=1.6
This issue is related to: https://github.com/freelawproject/courtlistener/issues/2688
The problem is because the Seventh Circuit of Appeals in the United States is adding additional information to PDF files, the problem is that their documents that do not comply with the PDF specification, adding new information:
You can see the pdf file here: http://media.ca7.uscourts.gov/cgi-bin/OpinionsWeb/processWebInputExternal.pl?Submit=Display&Path=Y2023/D04-27/C:22-2500:J:Brennan:aut:T:fnOp:N:3036932:S:0
This causes python-magic to be unable to identify the correct content type(application/octet-stream instead of application/pdf) and therefore not to detect the file extension correctly.
To solve this i updated the microservice that takes care of detecting file extensions by selecting the first 1024 bytes and looking for the pdf version using a regex to match "%PDF-X.X" where X.X is the version, e.g. %PDF=1.6