cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
158 stars 34 forks source link

How to handle two sets of bytes for matching improvements? #46

Closed NebularNerd closed 6 months ago

NebularNerd commented 7 months ago

Hi there,

I'm looking for a python package to help identify weird and wonderful files inside various scripts. I had seen fleep but that appears to be dead. Puremagic looks to offer the same functionality for what I want it for.

One job is for handling Amiga .iff files in an image conversion script. Having a quick look, it's nice to see .iff getting some love: https://github.com/cdgriffith/puremagic/blob/ff042db17e7477bbabcb9c5b7e8562a697f6b1cb/puremagic/magic_data.json#L1084

But in Amiga land that .iff FORM header is used for many things Wikipedia: List_of_file_signatures

image

Is there a way to help improve mapping and confidence by adding additional matching strings such as ILBM ACBM etc..? I'm happy to help with a PR if it can be done.

cdgriffith commented 6 months ago

What we could do there is instead of matching at offset 0 and FORM we can change to the offset where the more accurate info lives and match there instead.

Don't currently have a way to do wildcards, so can't be as accurate matching both FORM and ACBM

Thanks for the info, I can work on that when I have time. If you know a source of sample files for that please share!

NebularNerd commented 6 months ago

Instead or in addition to wildcards another option could be dual match, take our .iff sample, we could look to do...

[["464f524d","494c424d"], [0,8], "", "application/x-iff", "IFF file"],

If your code sees a list instead of a string, process both hex matches using the matching offset from the next list, if both matches, we get pretty much 100% confidence it's what we think it is. Logic is a little weirder than wildcarding but it's another possible way.

Aminet is pretty much the internet oldest resource for all things Amiga, we should be able to find pretty much all things there.

7zip will happily unpack most of the .lha and other formats you'll find there. If you get stuck on any let me know and I'm sure I can unearth samples from somewhere.

cdgriffith commented 6 months ago

Thanks for the samples! Added a multi-part detect.

Should be working in 1.20 https://github.com/cdgriffith/puremagic/releases/tag/1.20

NebularNerd commented 6 months ago

Nice! I've just looked at the implementation and that's way a great way to handle it, much tidier than mine. I'll test it out later on a script I have for handling converting images between formats.

For retro uses this will be handy as there are a lot of older formats like file packers that use a two part fingerprint.