cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
158 stars 34 forks source link

2024-05-11 imghdr parity updates #75

Closed NebularNerd closed 3 months ago

NebularNerd commented 3 months ago

IMGHDR Parity update

Closes #68

These updates will ensure PureMagic has the ability to match anything imghdr could as well as, if not better in most cases.

.jpg (No changes):

There are improvements we could make to .jpg such as combining the JFIF and EXIF matches in regex's, but that can wait for post v2.0 to create more detailed/higher confidence matches.

.png (No changes):

Nothing to change, this matches PureMagic, all PNG's will have this header

.gif (No changes):

Nothing to change, this matches PureMagic. All GIF's will have one or the other header.

.tiff /.tif (Tidying):

PureMagic uses better matches already with 0x49492a00 and 0x4d4d002a which pretty much ensures it's a TIFF. There are actually loads of duplicate TIFF entries, I have removed the extraneous longer matches and duplicates. There seems to be another TIFF header of 0x492049 which is in PureMagic and loads of other file ID lists, however, it's not in the official spec. More investigation is needed on that before potential removal as a duff entry.

.rgb SGI image (Enhanced):

The PureMagic match is specifically for an SGI RGB Image with the following properties: RLE Compressed, 1 bpc, Multiple 2D Images. As mentioned in #68, this would be a great format for rule-based matching as the header contains a lot of information and has a long 404 dummy bytes chunk at the end. For now, I shall use basic matches similar to my PCX/MP3 work to ensure we get a good baseline to build on for the future.

.pbm / .pgm / .ppm (Improvements):

UPDATED I've added better descriptions to allow for ASCII or BINARY variants, also added multi-matches for SPACE, TAB and WIN/NIX newlines, and repeated them for those also followed by an #, this will improve matches but they will always have a lowish confidence. I also remove a rogue PGM match that was floating around by itself.

.sun Sun Raster (Enhanced):

The header above is in every SUN raster file, there is some variant specific info I have added to improve confidence/provide better info about the flavour of the file. Again, in the future there is a little more we could describe about the file if we wanted to.

.xbm X Bitmap (New!):

A new format for PureMagic, every file uses the above header. Oddly despite Wikipedia giving it a mime type, it's not listed at IANA. Improvements could be possible later by regex-ing width and height

.bmp and variants (No change to BMP yet, added some other headers):

That's a tiny header, but improving matches appears to require reading the DIB header, then converting that in to more readable data from the DWORD32 string. I've added some other headers from the format spec while I was there but again, they are small and need the more detailed matching to gain higher confidence. Without looking into it too much right now, I believe we would need to do something as shown here

.webp (Enhanced and Tidied):

The RIFF header is used by all manner of files (the acronym standing for Resource Interchange File Format). PureMagic supports .webp but with a mix of headers on their own, one for RIFF, one for WEBP and another which would have only matched the file it came from. The fix to this is to split the match into RIFF then multi-match WEBPVP8 for lossy, WEBPVP8L for lossless, WEBPVP8X for extended and WEBP for fringe cases.

Moving forward we can look to improve all RIFF based files in a similar way with multi-matches and other potential v2.0 enhancements

.exr OpenEXR (Added mimetype):

Nothing to change but added the mimetype from Wikipedia, potential in the future to expand details such as variants and versions (there seems to be at least 1.7 and 2.0's from an initial scan of the info).

BONUS Types:

A PR for just a couple of tweaks is no fun, lets add some more....

Quite OK Image Format .qoi:

A lightweight image format for games. Found it while looking at a port of Wipeout to various platforms

Quite OK Audio Format .qoa:

A lightweight audio format for games. Found it while looking at a port of Wipeout to various platforms

SimCity 2000 .sc2 maps:

I've been playing with the SC2KRender, always loved SimCity 2000 so why not add it. This is an IFF file so it uses the FORM we know and love that started the whole multi-match upgrade. Like RIFF above the IFF FORM format has a lot of sub variants that will benefit from multi-matchings. This should understand MAC, Amiga and PC created maps.

TZX Cassette image .tzx:

Primarily a ZX Spectrum emulator format, it's now used by a variety of 8bit emulators as the de-facto proper way to archive a tape.

PFM, Augmented PFM and PAM:

A couple of extra formats added while PBM/PGM/PPM fixing, these are extensions of the NETPBM format. PAM could be improved later with a regex for ENDHDR which would always be present but not at a fixed byte position.

Links:

NebularNerd commented 3 months ago

Mmmm, why do we have some weirdness with missing bytes in the .json, yet I did not touch those? Need to double check before committing. 🤔

Fixed it! Not sure what happened there, ready for the Pull

cclauss commented 3 months ago

Awesome!! I stumbled with this for 30 minutes today before realizing that this repo's active branch is develop, not master.

2024-05-11 imghdr parity updates #75

NebularNerd wants to merge 9 commits into cdgriffith:master from NebularNerd:2024-05-11-IMGHDR-Parity

It is unintuitive but cdgriffith:master above needs to be reset cdgriffith:develop to get the currently modified code and tests. You can do this by scrolling to the top of this page, clicking on the Edit button, and then clicking the base: master popup menu on the second line.

NebularNerd commented 3 months ago

Awesome!! I stumbled with this for 30 minutes today before realizing that this repo's active branch is develop, not master.

2024-05-11 imghdr parity updates #75

NebularNerd wants to merge 9 commits into cdgriffith:master from NebularNerd:2024-05-11-IMGHDR-Parity

It is unintuitive but cdgriffith:master above needs to be reset cdgriffith:develop to get the currently modified code and tests. You can do this by scrolling to the top of this page, clicking on the Edit button, and then clicking the base: master popup menu on the second line.

Normally @cdgriffith moves them over when he's ready to look at them, if he prefers, I can move it to develop when creating the pull.

cclauss commented 3 months ago

For #76 or #76.nextgen, I would be interested in creating parameterized tests just by reading puremagic/magic_data.json but I do not understand how to read that file. I would suggest we also need a puremagic/magic_data.md file that documents how that json file is laid out. The proposed .md file could also contain the notes and URLs in the commit message above.

Creating puremagic/magic_data.md could be move to a separate pull request.

NebularNerd commented 3 months ago

The magic.json may die in the future based on @cdgriffith's musings in #70 so it may at this point not be worth the time creating tests on what may not be in the future.

The .json is pretty straightforward is we take a line: ["5037", 0, ".pam", "image/x-portable-arbitrarymap", "Portable Arbitrary Map"] This gives us: HEX, STARTING BYTE, EXTENSION, MIMETYPE, NAME

If there is a matching section in multi-part such as:

"5037" : [
  ["0a", 2, ".pam", "", "Augmented Portable Float Map"],
  ["0d", 2, ".pam", "", "Augmented Portable Float Map"],
  ["0a5749445448", 2, ".pam", "", "Augmented Portable Float Map"],
  ["0d5749445448", 2, ".pam", "", "Augmented Portable Float Map"],
  ["0a484549474854", 2, ".pam", "", "Augmented Portable Float Map"],
  ["0d484549474854", 2, ".pam", "", "Augmented Portable Float Map"]     
]

This will then look for every pattern listed at the given byte offset, this can be a positive number, or a negative number to work backwards from the end of the file (see -128 for 544147/TAG in mp3's).

Once all matching is done the confidence scores are generated from the results list. As I was asking in #76 if you are trying to test the strings as an exact match they may fail. @cdgriffith's original goal (I assume) with the confidence method is to ensure that the best real-world match is given.

From my personal approach to the PR's, I make use of official specs and test files as far as possible, which is why my PR's can sometimes be a bit wordy to explain choices and reasons behind them,

For example, while TIFF uses MM or II, I'll use a longer match based on the specs (which state that all tiff's will be 4d4d002a or 49492a00) to boost confidences and prevent false matches.

cdgriffith commented 3 months ago

Thank you for all this hard work @NebularNerd !

Feel free to set it directly to develop, as that should be latest and what will be in next release. Otherwise when I switch it may force you to do refactoring, and don't want double work!

@cclauss Sorry for no clear documentation starting out for the magic data, never assumed anyone would actually work on this repo but me 😆

I still see commits coming in @NebularNerd when you are complete let me know and I can merge!

NebularNerd commented 3 months ago

I'll call it done for now. The main thing was to get parity with imghdr. 😎

NebularNerd commented 3 months ago

You might have to reopen #2 after, did not know GitHub looks at the title and uses thins in there too close issues. One of my commits has a title that I cannot edit that may close it.

cdgriffith commented 3 months ago

The readthedocs has been decommed for a while, thank you, merging!