cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
170 stars 33 forks source link

Webp image mime type is empty #44

Closed fft001 closed 10 months ago

fft001 commented 1 year ago

Hello,

I encountered a discrepancy when running a test with the following code:

import puremagic

print(puremagic.from_file("test/resources/images/test.webp", mime=True)) # prints "image/webp"
with open("test/resources/images/test.webp", "rb") as f:
    print(puremagic.from_string(f.read(), mime=True)). # prints ""

In comparison, the python-magic library outputs "image/webp" for both the from_file and from_buffer functions.

I am uncertain whether this difference in behavior is intentional.

Thank you!

cdgriffith commented 1 year ago

This is due to how I configured confidence levels internally and just being a bit buggy in this case. Basically boils down too that from_file will look for both the magic number and file extension, but from_string can only look at the magic number.

In this case there are a lot of file types that match with the same length of string to compare:

with open("test/resources/images/test.webp", "rb") as f:
    print(puremagic.magic_string(f.read()))

[
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.4xm', mime_type='', name='4X Movie video', confidence=0.4), 
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cdr', mime_type='', name='CorelDraw document', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.avi', mime_type='video/avi', name='Resource Interchange File Format', confidence=0.4), 
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cda', mime_type='', name='Resource Interchange File Format', confidence=0.4),
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.qcp', mime_type='audio/vnd.qcelp', name='Resource Interchange File Format', confidence=0.4), 
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.rmi', mime_type='audio/mid', name='Resource Interchange File Format', confidence=0.4), 
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.wav', mime_type='audio/wav', name='Resource Interchange File Format', confidence=0.4), 
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ds4', mime_type='', name='Micrografx Designer graphic', confidence=0.4), 
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ani', mime_type='application/x-navi-animation', name='Windows animated cursor', confidence=0.4), 
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.dat', mime_type='video/mpeg', name='Video CD MPEG movie', confidence=0.4), 
PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cmx', mime_type='', name='Corel Presentation Exchange metadata', confidence=0.4), 
PureMagicWithConfidence(byte_match=b'WEBP', offset=8, extension='.webp', mime_type='image/webp', name='RIFF WebP', confidence=0.4)]

That is because if we read the file itself, we can see it starts out like:

b'RIFF$\x00\x00\x00WEBPVP8

So a lot of types match on RIFF (which is good / needed, because if they don't have a more specific match, but have a matching file type and someone uses from_file they will get the right type)

However, obvious to us humans, WEBP is a much more specific match type, and should be weighted higher. I do not have that logic currently. So instead I am going to just add longer strings for WebP to boost the confidence for now, so if run again will return

[PureMagicWithConfidence(byte_match=b'RIFF$\x00\x00\x00WEBP', offset=0, extension='.webp', mime_type='image/webp', name='RIFF WebP', confidence=0.8), 
PureMagicWithConfidence(byte_match=b'RIFF$\x00\x00\x00WEBPVP8', offset=0, extension='.webp', mime_type='image/webp', name='RIFF WebP VP8', confidence=0.8), 
...

This is by adding these two lines to magic_data.json

    ["524946462400000057454250", 0, ".webp", "image/webp", "RIFF WebP"],
        ["524946462400000057454250565038", 0, ".webp", "image/webp", "RIFF WebP VP8"],

And therefor:

 with open("test/resources/images/test.webp", "rb") as f:
    print(puremagic.from_string(f.read(), mime=True))

image/webp
cdgriffith commented 10 months ago

Added better checks for webp https://github.com/cdgriffith/puremagic/releases/tag/1.20