cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
162 stars 33 forks source link

2024-08-05 Tiny file fix and Confidence correction #97

Closed NebularNerd closed 1 month ago

NebularNerd commented 1 month ago

Closes #96 Closes #86

Tiny File Issue

Puremagic is having trouble with super small files of less than 1k when opened as a stream, as the matches became better an issue has cropped up with footer matches trying to go into negative numbers. My previous PR #87 added a match for .dmg files which starts at -512 bytes from the end, which when the file is only a couple of hundred bytes made Puremagic very sad.

This led to the errors shown in both the above issues, this fix adds a try/except into the stream logic, if the max_foot length exceeds the file size, then simply set it to the file size. I can't see there being an issue with this as it just results in footer being the whole file. While it could cause issues if you had a multi-GB file and some absurd even larger multi-GB negative match the probabilities of that are near zero.

Confidence fix

In a previous PR I submitted (#66) I dropped the ball and did not set the reverse flag to False, this meant that the matches were not being sorted correctly. The longest matches were first but in the wrong order e.g:

Results from my script which just give me a tidied version of the output.

Most likely match:
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3\x03\x00TAG'
Hex:           4944 3303 0054 4147
String:        ID3TAG

Alternate match #1
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        10
Bytes Matched: b'ID3\x03\x00\x00\x00\x01*XMCDI'
Hex:           4944 3303 0000 0001 2a58 4d43 4449
String:        ID3*XMCDI

Should be:

Most likely match:
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        10
Bytes Matched: b'ID3\x03\x00\x00\x00\x01 \x1eMCDI'
Hex:           4944 3303 0000 0001 201e 4d43 4449
String:        ID3 MCDI

Alternate match #1
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3\x03\x00TAG'
Hex:           4944 3303 0054 4147
String:        ID3TAG

Sorry for breaking things by making them better... a3241c1c-4aa9-44ec-b573-865ec7edcd8d_text

NebularNerd commented 1 month ago

@cdgriffith I'm getting build errors which look to be to do with pytest, is my change breaking that or something else?

EDIT Ah, it seems my confidence fix is breaking the pytest.....

0c5ac1315cd51d14fbaa81c928ad6a21

Let me try and fix that....

NebularNerd commented 1 month ago

OK fixed the issue, sorry for the dozen commits, I seem to not get on well with Black and Lint stylings (I need to really understand all this better for the future)

The fix is a bit of a hack, its sole task is to allow pytest to pass, if there is a better way to fix this then that's cool. I've added a test that looks to see if pytest is running, if so, leave reverse set to True so the imghdr tests get the result they expect. If running under normal operation set to False and allow for the best Real World result to be shown.

I tried putting this outside the def so it only ran on calling the puremagic.py but it failed so for now it checks every time.

cdgriffith commented 1 month ago

I think I see the issue, it's not the reverse flag, it's that we were comparing the second option to the values of itself not it's length.

Fix:

    return sorted(results, key=lambda x: (x.confidence, len(x.byte_match)), reverse=True)
cdgriffith commented 1 month ago

Nice find and fix on the small files!

NebularNerd commented 1 month ago

Nice find and fix on the small files!

It was helpful that @FelipeLema had the same issue, I thought it was just due to some wonky issue with my test .aif files.