algoo / preview-generator

generates previews of files with cache management
https://pypi.org/project/preview-generator/
MIT License
228 stars 50 forks source link

get_jpeg_preview fails with docx file when file extension is omitted #183

Open iqbalaydrus opened 4 years ago

iqbalaydrus commented 4 years ago

As the title suggests, get_jpeg_preview fails when .docx extension is omitted from filename

>>> from preview_generator.manager import PreviewManager
>>> manager = PreviewManager('/tmp')
>>> manager.get_jpeg_preview('/test', height=1920)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/preview_generator/manager.py", line 160, in get_jpeg_preview
    file_path = self.get_pdf_preview(file_path=file_path, force=force)
  File "/usr/local/lib/python3.8/site-packages/preview_generator/manager.py", line 201, in get_pdf_preview
    builder.build_pdf_preview(
  File "/usr/local/lib/python3.8/site-packages/preview_generator/preview/builder/document_generic.py", line 142, in build_pdf_preview
    self._convert_to_pdf(
  File "/usr/local/lib/python3.8/site-packages/preview_generator/preview/builder/office__libreoffice.py", line 65, in _convert_to_pdf
    return convert_office_document_to_pdf(
  File "/usr/local/lib/python3.8/site-packages/preview_generator/preview/builder/office__libreoffice.py", line 84, in convert_office_document_to_pdf
    raise InputExtensionNotFound("unable to found input extension from mimetype")  # nopep8
preview_generator.exception.InputExtensionNotFound: unable to found input extension from mimetype

it works fine when the extension is included:

>>> from preview_generator.manager import PreviewManager
>>> manager = PreviewManager('/tmp')
>>> manager.get_jpeg_preview('/test.docx', height=1920)
'/tmp/c19cf2e7dc39589d303b31780cf3b98e-1920x1920.jpeg'

after some digging, this is probably a bug in the mimetypes package. A MimeTypes object will fail finding docx extension:

>>> import mimetypes
>>> a = mimetypes.MimeTypes()
>>> a.guess_extension('application/vnd.openxmlformats-officedocument.wordprocessingml.document')
>>> 

even though its package method succeeded in finding the extension:

>>> import mimetypes
>>> mimetypes.guess_extension('application/vnd.openxmlformats-officedocument.wordprocessingml.document')
'.docx'
>>>

A dirty patch will work if we add this line:

>>> from preview_generator.extension import mimetypes_storage
>>> mimetypes_storage.add_type('application/vnd.openxmlformats-officedocument.wordprocessingml.document', '.docx')
>>> from preview_generator.manager import PreviewManager
>>> manager = PreviewManager('/tmp')
>>> manager.get_jpeg_preview('/test', height=1920)
'/tmp/d4690b7c15940c468240222261d03db6-1920x1920.jpeg'

I'm using python 3.8. Installed preview-generator through PIP, version 0.13

buxx commented 4 years ago

Hello iqbalaydrus, we will take a look ! Thanks for report.

inkhey commented 4 years ago

Hello @iqbalaydrus, thank for reporting this issue,

This is an Interesting case. I cannot reproduce in my own computer, i get image properly builded and mimetypes work as expected:

import mimetypes
a = mimetypes.MimeTypes()
a.guess_extension('application/vnd.openxmlformats-officedocument.wordprocessingml.document')
'.docx'

The mecanism of mimetype is getting mimetype from different source, mimetype itself know some mimetype but it will check also in system data to get information. As far i checked, it seems that your OS does not provide as much mimetype informations as mine. You can check these files: https://github.com/python/cpython/blob/master/Lib/mimetypes.py#L42 Can you give me your OS, this way it will be easier to know in which system the issue happened ?

We do have a mecanism to force some mimetype in builder in preview generator using "MimetypeMapping", but it's more for edge case and it override default system mimetype. This patch should fix the issue using this MimetypeMapping system:

diff --git a/preview_generator/preview/builder/office__libreoffice.py b/preview_generator/preview/builder/office__libreoffice.py
index d26183d..775bd76 100644
--- a/preview_generator/preview/builder/office__libreoffice.py
+++ b/preview_generator/preview/builder/office__libreoffice.py
@@ -38,7 +38,10 @@ class OfficePreviewBuilderLibreoffice(DocumentPreviewBuilder):
         return [
             MimetypeMapping(
                 "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", ".xlsx"
-            )
+            ),
+            MimetypeMapping(
+                "application/vnd.openxmlformats-officedocument.wordprocessingml.document", ".docx"
+            ),
         ]

     @classmethod

@iqbalaydrus Can you check yours mimetypes OS files if others file extension you may need can produce the same issue ?

iqbalaydrus commented 4 years ago

@inkhey thank you for the response.

I am using python 3.8 docker image, you can reproduce it like so:

iqbal@iqbalaydrus ~ % docker run -it python:3.8 python
Python 3.8.2 (default, Feb 26 2020, 14:58:38) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> a = mimetypes.MimeTypes()
>>> a.guess_extension('application/vnd.openxmlformats-officedocument.wordprocessingml.document')
>>> 

also doesn't work on latest macos:

iqbal@iqbalaydrus ~ % python3
Python 3.8.2 (default, Mar 11 2020, 00:29:50) 
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> a = mimetypes.MimeTypes()
>>> a.guess_extension('application/vnd.openxmlformats-officedocument.wordprocessingml.document')
>>> 

On both systems, the docx mimetypes exist. python docker image:

iqbal@iqbalaydrus ~ % docker run -it python:3.8 python
Python 3.8.2 (default, Feb 26 2020, 14:58:38) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from mimetypes import knownfiles
>>> import os
>>> [d for d in knownfiles if os.path.exists(d)]
['/etc/mime.types']
>>> [d for d in open('/etc/mime.types').readlines() if d.startswith('application/vnd.openxmlformats-officedocument.wordprocessingml.document')]
['application/vnd.openxmlformats-officedocument.wordprocessingml.document\t\tdocx\n']
>>> 

macos:

iqbal@iqbalaydrus ~ % python3
Python 3.8.2 (default, Mar 11 2020, 00:29:50) 
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from mimetypes import knownfiles
>>> import os
>>> [d for d in knownfiles if os.path.exists(d)]
['/etc/apache2/mime.types']
>>> [d for d in open('/etc/apache2/mime.types').readlines() if d.startswith('application/vnd.openxmlformats-officedocument.wordprocessingml.document')]
['application/vnd.openxmlformats-officedocument.wordprocessingml.document\tdocx\n']
>>>

After another digging, I mostly can confirm it's the fault at python's mimetypes package. The MimeTypes class only loads its default hardcoded types_map and didn't include the OS provided mimetype. Its init method:

    def __init__(self, filenames=(), strict=True):
        if not inited:
            init()
        self.encodings_map = _encodings_map_default.copy()
        self.suffix_map = _suffix_map_default.copy()
        self.types_map = ({}, {}) # dict for (non-strict, strict)
        self.types_map_inv = ({}, {})
        for (ext, type) in _types_map_default.items():
            self.add_type(type, ext, True)
        for (ext, type) in _common_types_default.items():
            self.add_type(type, ext, False)
        for name in filenames:
            self.read(name, strict)

should be: (didn't test it though)

    def __init__(self, filenames=(), strict=True):
        if not inited:
            init()
        self.encodings_map = encodings_map.copy()
        self.suffix_map = suffix_map.copy()
        self.types_map = ({}, {}) # dict for (non-strict, strict)
        self.types_map_inv = ({}, {})
        for (ext, type) in types_map.items():
            self.add_type(type, ext, True)
        for (ext, type) in common_types.items():
            self.add_type(type, ext, False)
        for name in filenames:
            self.read(name, strict)

And I think we can force OS mimetype files with:

iqbal@iqbalaydrus ~ % python3
Python 3.8.2 (default, Mar 11 2020, 00:29:50) 
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> import os
>>> a = mimetypes.MimeTypes(filenames=[d for d in mimetypes.knownfiles if os.path.isfile(d)])
>>> a.guess_extension('application/vnd.openxmlformats-officedocument.wordprocessingml.document')
'.docx'
>>> 
iqbal@iqbalaydrus ~ % docker run -it python:3.8 python
Python 3.8.2 (default, Feb 26 2020, 14:58:38) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> import os
>>> a = mimetypes.MimeTypes(filenames=[d for d in mimetypes.knownfiles if os.path.isfile(d)])
>>> a.guess_extension('application/vnd.openxmlformats-officedocument.wordprocessingml.document')
'.docx'
>>>