cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
168 stars 33 forks source link

`.wav` files detected as `audio/wave` when maybe they should be `audio/wav` #104

Open simonw opened 5 hours ago

simonw commented 5 hours ago

As far as I can tell, the "correct" type to return for a .wav file (with 52 49 46 46 xx xx xx xx 57 41 56 45 66 6d 74 20 is audio/wav - but this library returns audio/wave.

I got very confused looking through the code because I came across these two lines:

https://github.com/cdgriffith/puremagic/blob/763349ec4d02ba930fb1142c6eb684afdf06c6ab/puremagic/magic_data.json#L103 https://github.com/cdgriffith/puremagic/blob/763349ec4d02ba930fb1142c6eb684afdf06c6ab/puremagic/magic_data.json#L1118

I've found it hard to research the correct resolution though, as both audio/wav and audio/wave are entirely missing from what I thought was the official RFC for these! https://www.iana.org/assignments/media-types/media-types.xhtml#audio

MDN lists audio/wav https://developer.mozilla.org/en-US/docs/Web/HTTP/MIME_types/Common_types

I'm not sure there is a correct answer to this question.

simonw commented 5 hours ago

Tried this:

python -c 'import puremagic, pprint, sys; pprint.pprint(puremagic.magic_stream(open(sys.argv[-1], "rb")))' output.wav

And got:

[PureMagicWithConfidence(byte_match=b'RIFFH\xe0\x02\x00WAVE', offset=8, extension='.wav', mime_type='audio/wave', name='Waveform Audio File Format', confidence=0.8),
 PureMagicWithConfidence(byte_match=b'WAVEfmt ', offset=8, extension='.wav', mime_type='audio/x-wav', name='Windows audio file ', confidence=0.8),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.4xm', mime_type='', name='4X Movie video', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cdr', mime_type='', name='CorelDraw document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.avi', mime_type='video/avi', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cda', mime_type='', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.qcp', mime_type='audio/vnd.qcelp', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.rmi', mime_type='audio/mid', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.wav', mime_type='audio/wav', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ds4', mime_type='', name='Micrografx Designer graphic', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ani', mime_type='application/x-navi-animation', name='Windows animated cursor', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.dat', mime_type='video/mpeg', name='Video CD MPEG movie', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cmx', mime_type='', name='Corel Presentation Exchange metadata', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.webp', mime_type='image/webp', name='RIFF WebP', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'WAVE', offset=8, extension='.wav', mime_type='audio/x-wav', name='WAV audio', confidence=0.4)]
simonw commented 3 hours ago

I had a similar issue on llm-gemini where puremagic was returning audio/mpeg for MP3 files but the Gemini AI wanted audio/mp3:

It turned out in that case puremagic was correct and Gemini was wrong - the official mimetype for MP3 is indeed audio/mpeg.