ahupp / python-magic

A python wrapper for libmagic
Other
2.64k stars 283 forks source link

magic.from_buffer and magic.from_file give different outputs. #185

Closed mandy13 closed 5 years ago

mandy13 commented 5 years ago

Hi,

I have been using magic library for detecting mime types and found that getting two different results for the same file when using from_buffer and from_file. Below is the python snippet I tried.

`

import magic magic.Magic(mime=True, mime_encoding=True).from_buffer(open("Downloads/Document-magic.docx","r").read(1024)) 'application/zip; charset=binary' magic.Magic(mime=True, mime_encoding=True).from_file("Downloads/Document-magic.docx") 'application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary' `

Python version 2.7.12 Does magic library support detecting mime types for files created on one drive or google docs.

Thanks in advance

whyman commented 5 years ago

DOCX has an outer zip container, so perhaps you are not reading enough bytes (only 1kb) for it to see the inner content -- where as the file based option can read as much as it needs?

mandy13 commented 5 years ago

Ya Right @v00d00 I read 2kb instead of 1kb and got the same output as seen in from_file.

Any idea on a generic number of bytes to read so that it does not fail for any type of files.

Thanks.

whyman commented 5 years ago

Some files have their "signature" bytes at the end of the file, so the most reliable way would be the entire file, at least in my experience.

But as that may be impractical, more the more bytes the better.

ahupp commented 5 years ago

If you have an actual file on disk I think using from_file is preferable, for this reason. But I'm not actually aware of any general guidelines for how much from_buffer needs; I think in many cases it's just a few bytes. Sounds like the question is answered so I'll close.