Improve PDF file detection, fix description

cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers

MIT License

158 stars 34 forks source link

Improve PDF file detection, fix description #93

Closed peterekepeter closed 3 weeks ago

peterekepeter commented 1 month ago

Hi!

There is at least one system out in the wild that produces pdf files which start with a CRLF.

I added it as an extra entry.

Though from my testing, you can have any junk in front of the file as long as at some point you encounter the %PDF- string so a proper fix would be to look for the sequence of bytes/characters.

Anyways, stay safe out there!

NebularNerd commented 1 month ago

Part of the v2.0 plan is to better/faster/more awesome ways to perform matching, my experimental PR #65 would help with these fringe issues.

I never looked at a PDF header, I notice it has a version in there as well, something to file away for the future for more providing more details on matches (per #69)

@peterekepeter: Please add Closes #94 to the top of your post so your issue automatically closes when the PR is merged.

peterekepeter commented 1 month ago

My PR does not close #94 it just covers more cases without rearchitecting anything.

I opened an issue separately because the PDF magic sequence can be at any offset inside the file... which is not something the library was planned to do at all.

NebularNerd commented 1 month ago

My bad, I did skim the issue where I should have read it more before suggesting the close.

PDF's are something I wanted to look at more later on as I had a project where I needed to OCR them in bulk, being able to decipher what flavor they are before carrying out work on them would help cut down unnecessary work.

Looking at Wikipedia: PDF and PDF FileTypes, there is a lot we can look to extract detail wise in the future.

cdgriffith commented 3 weeks ago

Thank you for the addition and fix @peterekepeter !