Libreoffice process stuck when processing "html" file

danielgoqueiroz commented 3 years ago

For some reason when i sendo to process some "html" files the process is getting stuck. Not all "html" files, just a few. However, I was able to observe that when adding a parameter "--infilter = writerglobal8_HTML" in the library file "office__libreoffice.py", line 120, the processing of this ".html" occurred successfully.

I send the "html" file for your analysis, if you need to

For now, I added this line of code, as you can see this line below and it worked for me:

        with libreoffice_lock:
            check_call(
                [
                    "libreoffice",
                    "--headless",
                    **("--infilter=writerglobal8_HTML" if mimetype == "text/html" else ""), **    <---------
                    "--convert-to",
                    "pdf:writer_pdf_Export",
                    temporary_input_content_path,
                    "--outdir",
                    cache_path,
                    "-env:UserInstallation=file:///tmp/LibreOffice-conversion-{}".format(
                        cache_path_hash
                    ),  # nopep8
                ],
                stdout=DEVNULL,
                stderr=STDOUT,
            )

I appreciate the attention.

file_bug.html.zip

inkhey commented 3 years ago

Hello. Thanks for this interesting contribution.

I just tested and effectively there is a problem with libreoffice itself that hang with the current configuration:

libreoffice --headless --convert-to pdf:writer_pdf_Export /<path>/02_arquivo_bug.html --outdir /tmp/cache -env:UserInstallation=file:///tmp/LibreOffice-conversion-test
convert /<path>/02_arquivo_bug.html -> /tmp/cache/02_arquivo_bug.pdf using filter : writer_pdf_Export

Not completely sure, it the same problem, but there similar issue in Libreoffice bugzilla: https://bugs.documentfoundation.org/show_bug.cgi?id=140177

Your workaround work too, so i suggest you to propose a pull request with just few modifications if you agree:

make it easy to have special behavior for other mimetype too (make a generic custom command parameters variable permitting this and move the if/else outside of check_call).
add test for html file too to ensure we do not break the feature later.

For later, depending on how useful may be these filters for other mimetype, i think we can consider storing a mimetype:custom_parameters dictionary (or similar object) in the OfficePreviewBuilderLibreoffice.

danielgoqueiroz commented 3 years ago

Hello @inkhey, first, sorry about the delay on answer and of course than i accept your propose, it was a great suggestion.

I did the adjust using my low level of python knowledge and yours very good tips. But if anithing going wrong, please let me know.

Finally, "It was a small step for your library, but a giant leap for my python learning." hahaha

Thanks for your atention.

inkhey commented 3 years ago

fixed by #228

algoo / preview-generator

Libreoffice process stuck when processing "html" file #226