cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
169 stars 33 forks source link

imghdr matches in PureMagic? #68

Closed NebularNerd closed 4 months ago

NebularNerd commented 6 months ago

I've re-titled this as it got a bit off-track from the original purpose of the issue.

@cclauss wanted to know if PureMagic could provide a 1:1 replacement of imghdr, to avoid crosstalk I've moved the original content to #69

cclauss commented 6 months ago

Did imghdr have tests we should attempt to run here?

NebularNerd commented 6 months ago

Looking at the imghdr source it runs a very primitive test, and can only match 0x01da / b'\001\332'.

https://github.com/python/cpython/blob/af8db2b6817e1ae25cbee98e67d1f4bac9d6af9c/Lib/imghdr.py#L73-L76

The current magic in PureMagic would not pick this up as it currently has a longer variant specific match, adding a short match like above is fine, but it can lead to low confidence scores and potential clashes if another file uses the same header. The best approach for good match scores and a definitive solid match is to use the longest header and multi-match possible.

I'll likely run up a PR using the same conventions as I did for PCX, this issue is more about a thought about naming conventions and potential better presentation of the results.

Also, as an aside, SGI would make a great example of a rule-based detection as the header is quite complicated with about 6 unique fixed fingerprints.

cclauss commented 6 months ago

I meant all the tests in that file for all the format, not just rgb.

NebularNerd commented 6 months ago

Looking at the source again, a quick run-down:

X Bitmap looks to not be present, SGI we know needs some love, the other formats are supported to at least the same degree the imghdr uses (there are a couple that need tidying and testing). If I do a PR then it would include the missing formats, plus improvements to the existing ones where I think PureMagic can now do a better job.

NebularNerd commented 4 months ago

I'll close this as my PR #75 was merged but it did not close this.