Closed fft001 closed 10 months ago
This is due to how I configured confidence levels internally and just being a bit buggy in this case. Basically boils down too that from_file
will look for both the magic number and file extension, but from_string
can only look at the magic number.
In this case there are a lot of file types that match with the same length of string to compare:
with open("test/resources/images/test.webp", "rb") as f:
print(puremagic.magic_string(f.read()))
[
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.4xm', mime_type='', name='4X Movie video', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cdr', mime_type='', name='CorelDraw document', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.avi', mime_type='video/avi', name='Resource Interchange File Format', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cda', mime_type='', name='Resource Interchange File Format', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.qcp', mime_type='audio/vnd.qcelp', name='Resource Interchange File Format', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.rmi', mime_type='audio/mid', name='Resource Interchange File Format', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.wav', mime_type='audio/wav', name='Resource Interchange File Format', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ds4', mime_type='', name='Micrografx Designer graphic', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ani', mime_type='application/x-navi-animation', name='Windows animated cursor', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.dat', mime_type='video/mpeg', name='Video CD MPEG movie', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cmx', mime_type='', name='Corel Presentation Exchange metadata', confidence=0.4),
PureMagicWithConfidence(byte_match=b'WEBP', offset=8, extension='.webp', mime_type='image/webp', name='RIFF WebP', confidence=0.4)]
That is because if we read the file itself, we can see it starts out like:
b'RIFF$\x00\x00\x00WEBPVP8
So a lot of types match on RIFF
(which is good / needed, because if they don't have a more specific match, but have a matching file type and someone uses from_file
they will get the right type)
However, obvious to us humans, WEBP
is a much more specific match type, and should be weighted higher. I do not have that logic currently. So instead I am going to just add longer strings for WebP to boost the confidence for now, so if run again will return
[PureMagicWithConfidence(byte_match=b'RIFF$\x00\x00\x00WEBP', offset=0, extension='.webp', mime_type='image/webp', name='RIFF WebP', confidence=0.8),
PureMagicWithConfidence(byte_match=b'RIFF$\x00\x00\x00WEBPVP8', offset=0, extension='.webp', mime_type='image/webp', name='RIFF WebP VP8', confidence=0.8),
...
This is by adding these two lines to magic_data.json
["524946462400000057454250", 0, ".webp", "image/webp", "RIFF WebP"],
["524946462400000057454250565038", 0, ".webp", "image/webp", "RIFF WebP VP8"],
And therefor:
with open("test/resources/images/test.webp", "rb") as f:
print(puremagic.from_string(f.read(), mime=True))
image/webp
Added better checks for webp https://github.com/cdgriffith/puremagic/releases/tag/1.20
Hello,
I encountered a discrepancy when running a test with the following code:
In comparison, the
python-magic
library outputs "image/webp" for both thefrom_file
andfrom_buffer
functions.I am uncertain whether this difference in behavior is intentional.
Thank you!