Closed sanderjo closed 4 months ago
Ah, thanks to @safihre
>>> import puremagic
>>> filename = "test/resources/audio/testblabla.bla"
>>> bla = puremagic.magic_file(filename)
>>> for i in bla:
... print(i)
...
PureMagicWithConfidence(byte_match=b'ID3\x03\x00\x00\x00', offset=0, extension='.koz', mime_type='', name='Sprint Music Store audio', confidence=0.7)
PureMagicWithConfidence(byte_match=b'ID3', offset=0, extension='.mp3', mime_type='audio/mpeg', name='MPEG-1 Audio Layer 3 (MP3) audio file', confidence=0.3)
>>> bla[0].extension
'.koz'
>>> bla[1].extension
'.mp3'
So ... puremagic thinks (0.7 probability) it's .koz (because of the longer matching bytestring?), and 0.3 probability it's .mp3
In the real world, I would say mp3 is much more likely than koz. So each extension would have a Real World Probablity. Wild guess:
.mp3: 99% .koz: 1%
So based on that, mp3 would be more likely for this case. So, I would need to interpret / combine the pure puremagic indication with Real World Probabilities.
Real World Probability: common extensions on https://www.computerhope.com/issues/ch001789.htm
>>> mylikelyextlist = [ '3g2','3gp','7z','ai','aif','apk','arj','asp','aspx','avi','bak','bat','bin','bin','bmp','c','cab','cda','cer','cfg','cfm','cgi','cgi','cgi','class','com','cpl','cpp','cs','css','csv','cur','dat','db','dbf','deb','dll','dmg','dmp','doc','docx','drv','email','eml','emlx','exe','flv','fnt','fon','gadget','gif','h','h264','htm','html','icns','ico','ico','ini','iso','jar','java','jpeg','jpg','js','jsp','key','lnk','log','m4v','mdb','mid','midi','mkv','mov','mp3','mp4','mpa','mpeg','mpg','msg','msi','msi','odp','ods','odt','oft','ogg','ost','otf','part','pdf','php','php','pkg','pl','pl','pl','png','pps','ppt','pptx','ps','psd','pst','py','py','py','rar','rm','rpm','rss','rtf','sav','sh','sql','svg','swf','swift','sys','tar','tar','gz','tex','tif','tiff','tmp','toast','ttf','txt','vb','vcd','vcf','vob','wav','wma','wmv','wpd','wpl','wsf','xhtml','xls','xlsm','xlsx','xml','z','zip' ]
>>> 'mp3' in mylikelyextlist
True
>>> 'koz' in mylikelyextlist
False
List generated like this:
sander@brixit:~$ lynx --dump 'https://www.computerhope.com/issues/ch001789.htm' | grep "\* \." | awk -F\- '{ print $1 }' | tr -d "*" | sed -e 's/and/\n/g' | sed -e 's/or/\n/g' | tr -d " " | sort | sed -e "s/\./','/g" | tr -d '\n'
','3g2','3gp','7z','ai','aif','apk','arj','asp','aspx','avi','bak','bat','bin','bin','bmp','c','cab','cda','cer','cfg','cfm','cgi','cgi','cgi','class','com','cpl','cpp','cs','css','csv','cur','dat','db','dbf','deb','dll','dmg','dmp','doc','docx','drv','email','eml','emlx','exe','flv','fnt','fon','gadget','gif','h','h264','htm','html','icns','ico','ico','ini','iso','jar','java','jpeg','jpg','js','jsp','key','lnk','log','m4v','mdb','mid','midi','mkv','mov','mp3','mp4','mpa','mpeg','mpg','msg','msi','msi','odp','ods','odt','oft','ogg','ost','otf','part','pdf','php','php','pkg','pl','pl','pl','png','pps','ppt','pptx','ps','psd','pst','py','py','py','rar','rm','rpm','rss','rtf','sav','sh','sql','svg','swf','swift','sys','tar','tar','gz','tex','tif','tiff','tmp','toast','ttf','txt','vb','vcd','vcf','vob','wav','wma','wmv','wpd','wpl','wsf','xhtml','xls','xlsm','xlsx','xml','z','zip
Sorry for late reply, not getting notifications for this repo even though watched it seems.
Just to explain the behavior a bit you were seeing at first, is that you are right koz
was higher confidence so it was winning when the file extension didn't match. However if it matches both file extension and content, it is given the highest confidence.
Definitly something to consider for real world scenarios. May have to check and see how file
handles stuff like that.
Thanks for replying.
I've implemented it in SABnzbd like this:
This affects ID3v2.3.0 version files which share the same header (sometimes) as .koz
. Basically, with ID3v2 you have:
ID3
for first three bytes0x0300
in the case of the example aboveTo improve confidence, we can do a couple of things:
.pcx
files on #50, this would obviously increase the number of entries as you would have to account for all variants of the flags TAG
header in the last 128 bytes for old v1.1 tagsP
at bytes 10 or 11I'm reading/playing around to see what would give the best consistant results.
Adding a longer versioned match for .mp3
and adding TAG
at -128 gives us 80% confidence, beating .koz
by 10%. This only works if the file has tags, untagged files would still match .koz
.
This is from my own Python script purely for confidence testing.
(01) Adamski - Killer.mp3
Most likely match:
Format: MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence: 80.0%
Extension: .mp3
MIME: audio/mpeg
Offset: -128
Bytes Matched: b'ID3\x03\x00TAG'
Hex: 4944 3303 0054 4147
String: ID3TAG
Alternate match #1
Format: Sprint Music Store audio
Confidence: 70.0%
Extension: .koz
MIME:
Offset: 0
Bytes Matched: b'ID3\x03\x00\x00\x00'
Hex: 4944 3303 0000 00
String: ID3
Alternate match #2
Format: MPEG-1 Audio Layer 3 (MP3) audio file
Confidence: 60.0%
Extension: .mp3
MIME: audio/mpeg
Offset: -128
Bytes Matched: b'ID3TAG'
Hex: 4944 3354 4147
String: ID3TAG
Alternate match #3
Format: MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence: 50.0%
Extension: .mp3
MIME: audio/mpeg
Offset: 0
Bytes Matched: b'ID3\x03\x00'
Hex: 4944 3303 00
String: ID3
Alternate match #4
Format: MPEG-1 Audio Layer 3 (MP3) audio file
Confidence: 30.0%
Extension: .mp3
MIME: audio/mpeg
Offset: 0
Bytes Matched: b'ID3'
Hex: 4944 33
String: ID3
Let's see what else we can match against in case TAG
is not present. 🤔
Updated in 1.23, thanks @NebularNerd !
same (mp3) file, different name ... different output
Make a copy:
sander@brixit:~/git/puremagic$ cp test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
Verify it's there with same size:... and same contents:
... but puremagic says the first one is mp3 and the second is ... koz?
Is this wanted behaviour, or a bug?
PS: Linux'
file
reports it correctly as mp3: