cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
158 stars 34 forks source link

2024-05-01 MP3 Detection improvements #63

Closed NebularNerd closed 3 months ago

NebularNerd commented 4 months ago

Closes #32

MP3's are a strange beast, many bits have been grafted on over the decades, the word 'standard' requires a big ⭐ next to it when talking about them.

To get a higher (and hopefully definitive) match I've added versioned main fingerprints and a lot of multi match data (seriously loads). This should match pretty much any MP3 you come across with 0.8 confidences (assuming correct extension), beating false .koz matches into the dirt. I left the non-versioned .mp3 match in and added a TAG multi match to allow for fringe cases.

The .json has grown somewhat in file size to accommodate these matches, technically we could strip some of the 4 letter matches from 2.3 if I could find which ones did not apply until 2.4, however, there is little data regarding exactly what the additional ones were. Also to ensure the best confidence I had to duplicate the 4 letter matches for both v2.3 and v2.4. Again, it would also be possible to maybe sacrifice some of the more obscure 3/4 letter matches, but as there is no set rule for the ordering of the frame headers there is the potential for fringe cases where a rarely used one comes first.

Main fingerprints:

Multi-Part fingerprints:

Test file:

This is a weird one I found on a corner of a drive. It would have not matched as a .koz as it's a v2.2 but equally would have a low confidence match as it had no tags, using the additional 3 letter frame match you'll get a solid match. The output comes from my own confidence test script so I can easily see/test patterns. congratulations.zip

congratulations.mp3
Most likely match:
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.2.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        10
Bytes Matched: b'ID3\x02\x00\x00\x00\x00\x10BTT2'
Hex:           4944 3302 0000 0000 1042 5454 32
String:        ID3BTT2

Alternate match #1
Format:        MPEG-1 Audio Layer 3 ID3v2.2.0 (MP3) audio file
Confidence:    50.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3\x02\x00'
Hex:           4944 3302 00
String:        ID3

Alternate match #2
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    30.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3'
Hex:           4944 33
String:        ID3

Example matches:

(01) Adamski - Killer.mp3
Most likely match:
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3\x03\x00TAG'
Hex:           4944 3303 0054 4147
String:        ID3TAG

Alternate match #1
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        10
Bytes Matched: b'ID3\x03\x00\x00\x00\x01 \x1eMCDI'
Hex:           4944 3303 0000 0001 201e 4d43 4449
String:        ID3 MCDI

Alternate match #2
Format:        Sprint Music Store audio
Confidence:    70.0%
Extension:     .koz
MIME:
Offset:        0
Bytes Matched: b'ID3\x03\x00\x00\x00'
Hex:           4944 3303 0000 00
String:        ID3

Alternate match #3
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    60.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3TAG'
Hex:           4944 3354 4147
String:        ID3TAG

Alternate match #4
Format:        MPEG-1 Audio Layer 3 ID3v2.3.0 (MP3) audio file
Confidence:    50.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3\x03\x00'
Hex:           4944 3303 00
String:        ID3

Alternate match #5
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    30.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3'
Hex:           4944 33
String:        ID3
(01) Ash - Girl from Mars.mp3
Most likely match:
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        10
Bytes Matched: b'ID3\x03\x00\x00\x00\x01KTTPE1'
Hex:           4944 3303 0000 0001 4b54 5450 4531
String:        ID3KTTPE1

Alternate match #1
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3\x03\x00TAG'
Hex:           4944 3303 0054 4147
String:        ID3TAG

Alternate match #2
Format:        Sprint Music Store audio
Confidence:    70.0%
Extension:     .koz
MIME:
Offset:        0
Bytes Matched: b'ID3\x03\x00\x00\x00'
Hex:           4944 3303 0000 00
String:        ID3

Alternate match #3
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    60.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3TAG'
Hex:           4944 3354 4147
String:        ID3TAG

Alternate match #4
Format:        MPEG-1 Audio Layer 3 ID3v2.3.0 (MP3) audio file
Confidence:    50.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3\x03\x00'
Hex:           4944 3303 00
String:        ID3

Alternate match #5
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    30.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3'
Hex:           4944 33
String:        ID3
B007G7MTR2_(disc_1)_05_-_I'm_Too_Fat.mp3
Most likely match:
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.4.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3\x04\x00TAG'
Hex:           4944 3304 0054 4147
String:        ID3TAG

Alternate match #1
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.4.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        10
Bytes Matched: b'ID3\x04\x00\x00\x00\x10\x08\x1ePRIV'
Hex:           4944 3304 0000 0010 081e 5052 4956
String:        IDPRIV

Alternate match #2
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    60.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3TAG'
Hex:           4944 3354 4147
String:        ID3TAG

Alternate match #3
Format:        MPEG-1 Audio Layer 3 ID3v2.4.0 (MP3) audio file
Confidence:    50.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3\x04\x00'
Hex:           4944 3304 00
String:        ID3

Alternate match #4
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    30.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3'
Hex:           4944 33
String:        ID3

Links: