algoo / preview-generator

generates previews of files with cache management
https://pypi.org/project/preview-generator/
MIT License
228 stars 51 forks source link

Better mimetype handling, tweak or use share-mime-info lib ? #142

Open inkhey opened 4 years ago

inkhey commented 4 years ago

Mimetype guessing is complex, i do suggest to add a "mimetype guesser" class usable outside of preview-generator code (another project ?).

I do suggest to add support for two mimetype guesser:

Original issue text:

Current code use 3 mecanism to detect mimetype:

  • mimetypes.guess_type
  • magic.Magic
  • mimetype command

In many case, result may be different. mimetype.guess type can return ambiguous result like "text/xml" or "application/octet_stream". Both magic and mimetype command are best to distinct complex case, like office file, but may return different result.

Current code is complex, as it try different method in different case:

        first_path = file_path + file_ext if file_ext else file_path
        str_, encoding = mimetypes.guess_type(first_path, strict=False)

        if not str_ or str_ == "application/octet-stream":
            mime = magic.Magic(mime=True)
            str_ = mime.from_file(file_path)

        if str_ and (str_ in AMBIGUOUS_MIMES):
            raw_mime = Popen(
                ["mimetype", "--output-format", "%m", file_path],
                stdin=PIPE,
                stdout=PIPE,
                stderr=PIPE,
            ).communicate()[0]
            str_ = raw_mime.decode("utf-8").replace("\n", "")

In complex case like an ogg video without explicit extension in path, we get 3 different result:

  • audio/ogg by mimetypes.guess_type
  • video/ogg by magic.Magic
  • video/x-theora+ogg by mimetype command

First result is incorrect, both others are ok. Current code make difficult to handle easily best match. One solution may be doing in all case the 3 command and decide after which one to save, with an hardcoded preference for some pattern or mimetype.

inkhey commented 4 years ago

Note: it's possible to add manually new type to default mimetype list, using a specific mimetype datastore, this allow us to easily guess_extension and/or mimetype from both extension/mimetype:

>>>mimetypes_data = mimetypes.MimeTypes()
>>>mimetypes_data.add_type('image/x-sony-arw', 'arw')
>>>mimetypes_data.guess_extension('image/x-sony-arw')
'arw'

This may be useful in builder, if we need to obtain extension related to a file with mimetype known but without file_extension in path (currently we do not have "file_extension" parameter in builder, only file_path which need to exist). this can allow to have specific behavior for builder who really need explicit mimetype. For example with imagemagick builder, we can do, if we get 'arw from mimetype 'image/x-sony-arw' :

convert arw:DSC08523 -layers merge test.jpg