cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
161 stars 34 forks source link

same (mp3) file, different name ... different output: mp3 versus koz #32

Closed sanderjo closed 4 months ago

sanderjo commented 3 years ago

same (mp3) file, different name ... different output

Make a copy: sander@brixit:~/git/puremagic$ cp test/resources/audio/test.mp3 test/resources/audio/testblabla.bla Verify it's there with same size:

sander@brixit:~/git/puremagic$ ll test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
-rw-rw-r-- 1 sander sander 26989 jun 11 10:36 test/resources/audio/testblabla.bla
-rw-rw-r-- 1 sander sander 26989 jun 11 10:35 test/resources/audio/test.mp3

... and same contents:

sander@brixit:~/git/puremagic$ md5sum test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
3de8d656af21a836f2ba4f2949feb77c  test/resources/audio/test.mp3
3de8d656af21a836f2ba4f2949feb77c  test/resources/audio/testblabla.bla

... but puremagic says the first one is mp3 and the second is ... koz?

sander@brixit:~/git/puremagic$ python3 -m puremagic test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
'test/resources/audio/test.mp3' : .mp3
'test/resources/audio/testblabla.bla' : .koz

Is this wanted behaviour, or a bug?

PS: Linux' filereports it correctly as mp3:

sander@brixit:~/git/puremagic$ file  test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
test/resources/audio/test.mp3:       Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
test/resources/audio/testblabla.bla: Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
sanderjo commented 3 years ago

Ah, thanks to @safihre

>>> import puremagic
>>> filename = "test/resources/audio/testblabla.bla"
>>> bla = puremagic.magic_file(filename)

>>> for i in bla:
...     print(i)
...
PureMagicWithConfidence(byte_match=b'ID3\x03\x00\x00\x00', offset=0, extension='.koz', mime_type='', name='Sprint Music Store audio', confidence=0.7)
PureMagicWithConfidence(byte_match=b'ID3', offset=0, extension='.mp3', mime_type='audio/mpeg', name='MPEG-1 Audio Layer 3 (MP3) audio file', confidence=0.3)

>>> bla[0].extension
'.koz'
>>> bla[1].extension
'.mp3'

So ... puremagic thinks (0.7 probability) it's .koz (because of the longer matching bytestring?), and 0.3 probability it's .mp3

In the real world, I would say mp3 is much more likely than koz. So each extension would have a Real World Probablity. Wild guess:

.mp3: 99% .koz: 1%

So based on that, mp3 would be more likely for this case. So, I would need to interpret / combine the pure puremagic indication with Real World Probabilities.

sanderjo commented 3 years ago

Real World Probability: common extensions on https://www.computerhope.com/issues/ch001789.htm

>>> mylikelyextlist = [ '3g2','3gp','7z','ai','aif','apk','arj','asp','aspx','avi','bak','bat','bin','bin','bmp','c','cab','cda','cer','cfg','cfm','cgi','cgi','cgi','class','com','cpl','cpp','cs','css','csv','cur','dat','db','dbf','deb','dll','dmg','dmp','doc','docx','drv','email','eml','emlx','exe','flv','fnt','fon','gadget','gif','h','h264','htm','html','icns','ico','ico','ini','iso','jar','java','jpeg','jpg','js','jsp','key','lnk','log','m4v','mdb','mid','midi','mkv','mov','mp3','mp4','mpa','mpeg','mpg','msg','msi','msi','odp','ods','odt','oft','ogg','ost','otf','part','pdf','php','php','pkg','pl','pl','pl','png','pps','ppt','pptx','ps','psd','pst','py','py','py','rar','rm','rpm','rss','rtf','sav','sh','sql','svg','swf','swift','sys','tar','tar','gz','tex','tif','tiff','tmp','toast','ttf','txt','vb','vcd','vcf','vob','wav','wma','wmv','wpd','wpl','wsf','xhtml','xls','xlsm','xlsx','xml','z','zip' ]

>>> 'mp3' in mylikelyextlist 
True
>>> 'koz' in mylikelyextlist 
False

List generated like this:

sander@brixit:~$ lynx --dump 'https://www.computerhope.com/issues/ch001789.htm'  | grep "\* \." | awk -F\- '{ print $1 }' | tr -d "*" | sed -e 's/and/\n/g' | sed -e 's/or/\n/g'  | tr -d " " | sort | sed -e "s/\./','/g"  | tr -d '\n'

','3g2','3gp','7z','ai','aif','apk','arj','asp','aspx','avi','bak','bat','bin','bin','bmp','c','cab','cda','cer','cfg','cfm','cgi','cgi','cgi','class','com','cpl','cpp','cs','css','csv','cur','dat','db','dbf','deb','dll','dmg','dmp','doc','docx','drv','email','eml','emlx','exe','flv','fnt','fon','gadget','gif','h','h264','htm','html','icns','ico','ico','ini','iso','jar','java','jpeg','jpg','js','jsp','key','lnk','log','m4v','mdb','mid','midi','mkv','mov','mp3','mp4','mpa','mpeg','mpg','msg','msi','msi','odp','ods','odt','oft','ogg','ost','otf','part','pdf','php','php','pkg','pl','pl','pl','png','pps','ppt','pptx','ps','psd','pst','py','py','py','rar','rm','rpm','rss','rtf','sav','sh','sql','svg','swf','swift','sys','tar','tar','gz','tex','tif','tiff','tmp','toast','ttf','txt','vb','vcd','vcf','vob','wav','wma','wmv','wpd','wpl','wsf','xhtml','xls','xlsm','xlsx','xml','z','zip
cdgriffith commented 3 years ago

Sorry for late reply, not getting notifications for this repo even though watched it seems.

Just to explain the behavior a bit you were seeing at first, is that you are right koz was higher confidence so it was winning when the file extension didn't match. However if it matches both file extension and content, it is given the highest confidence.

Definitly something to consider for real world scenarios. May have to check and see how file handles stuff like that.

sanderjo commented 3 years ago

Thanks for replying.

I've implemented it in SABnzbd like this:

See https://github.com/sabnzbd/sabnzbd/blob/9b870e64d252ef9b7521269844fb6250a0d5728c/sabnzbd/utils/file_extension.py#L257-L263f

NebularNerd commented 4 months ago

This affects ID3v2.3.0 version files which share the same header (sometimes) as .koz. Basically, with ID3v2 you have:

To improve confidence, we can do a couple of things:

I'm reading/playing around to see what would give the best consistant results.

NebularNerd commented 4 months ago

Adding a longer versioned match for .mp3 and adding TAG at -128 gives us 80% confidence, beating .koz by 10%. This only works if the file has tags, untagged files would still match .koz.

This is from my own Python script purely for confidence testing.

(01) Adamski - Killer.mp3
Most likely match:
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3\x03\x00TAG'
Hex:           4944 3303 0054 4147
String:        ID3TAG

Alternate match #1
Format:        Sprint Music Store audio
Confidence:    70.0%
Extension:     .koz
MIME:
Offset:        0
Bytes Matched: b'ID3\x03\x00\x00\x00'
Hex:           4944 3303 0000 00
String:        ID3

Alternate match #2
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    60.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3TAG'
Hex:           4944 3354 4147
String:        ID3TAG

Alternate match #3
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    50.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3\x03\x00'
Hex:           4944 3303 00
String:        ID3

Alternate match #4
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    30.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3'
Hex:           4944 33
String:        ID3

Let's see what else we can match against in case TAG is not present. 🤔

cdgriffith commented 4 months ago

Updated in 1.23, thanks @NebularNerd !