algoo / preview-generator

generates previews of files with cache management
https://pypi.org/project/preview-generator/
MIT License
224 stars 49 forks source link

Locking mechanism for LibreOffice #214

Open onidlo opened 3 years ago

onidlo commented 3 years ago

Dear all,

Preview-generator works perfectly on most of the files, except several files, sample file is here. I don't see anything special in this file. If I run headless libreoffice from CLI directly to convert this file to PDF, it finishes in less than 1 second.

When I use preview-generator, it takes over 20 seconds (what is equal to LOCK_DEFAULT_TIMEOUT = 20). No other instance of LibreOffice is running, it works perfectly on other files and running headless libreoffice does not have this issue. So I think the locking mechanism is broken.

Here is my detailed output:

...
DEBUG:PreviewGenerator:register builder for video/x-ms-wmx: VideoPreviewBuilderFFMPEG
DEBUG:PreviewGenerator:register builder for video/x-ms-wvx: VideoPreviewBuilderFFMPEG
DEBUG:PreviewGenerator:register builder for video/x-msvideo: VideoPreviewBuilderFFMPEG
DEBUG:PreviewGenerator:register builder for video/x-sgi-movie: VideoPreviewBuilderFFMPEG
DEBUG:PreviewGenerator:register builder for video/x-matroska: VideoPreviewBuilderFFMPEG
DEBUG:PreviewGenerator:register builder for video/x-theora+ogg: VideoPreviewBuilderFFMPEG
DEBUG:PreviewGenerator:register builder for video/x-m4v: VideoPreviewBuilderFFMPEG
DEBUG:PreviewGenerator:register builder for application/pdf: PdfPreviewBuilderPyPDF2
ERROR:PreviewGenerator:Builder <class 'preview_generator.preview.builder.cad__vtk.ImagePreviewBuilderVtk'> is missing a dependency: this builder requires vtk to be available
DEBUG:filelock:Attempting to acquire lock 140291998118048 on ./1e24c391300a252003a7aba4fcc9a178.lock
INFO:filelock:Lock 140291998118048 acquired on ./1e24c391300a252003a7aba4fcc9a178.lock
DEBUG:PreviewGenerator:converting file bytes <_io.BufferedReader name='./F2423-a.xlsx'> to pdf file ./1e24c391300a252003a7aba4fcc9a178.pdf
DEBUG:PreviewGenerator:conversion is based on temporary file ./1e24c391300a252003a7aba4fcc9a178.pdf.xlsx
DEBUG:PreviewGenerator:temporary file written: ./1e24c391300a252003a7aba4fcc9a178.pdf.xlsx
DEBUG:PreviewGenerator:converting ./1e24c391300a252003a7aba4fcc9a178.pdf.xlsx to pdf into folder ./
DEBUG:filelock:Attempting to acquire lock 140291998118480 on ./libreoffice.lock
INFO:filelock:Lock 140291998118480 acquired on ./libreoffice.lock
DEBUG:filelock:Attempting to release lock 140291998118480 on ./libreoffice.lock
INFO:filelock:Lock 140291998118480 released on ./libreoffice.lock
DEBUG:PreviewGenerator:renaming output file ./1e24c391300a252003a7aba4fcc9a178.pdf.pdf to ./1e24c391300a252003a7aba4fcc9a178.pdf
INFO:PreviewGenerator:Removing temporary copy file ./1e24c391300a252003a7aba4fcc9a178.pdf.xlsx
DEBUG:PreviewGenerator:Removing flag file ./1e24c391300a252003a7aba4fcc9a178.pdf_flag
DEBUG:filelock:Attempting to release lock 140291998118048 on ./1e24c391300a252003a7aba4fcc9a178.lock
INFO:filelock:Lock 140291998118048 released on ./1e24c391300a252003a7aba4fcc9a178.lock
DEBUG:filelock:Attempting to acquire lock 140292011021552 on ./1e24c391300a252003a7aba4fcc9a178.lock
INFO:filelock:Lock 140292011021552 acquired on ./1e24c391300a252003a7aba4fcc9a178.lock
DEBUG:PreviewGenerator:converting file bytes <_io.BufferedReader name='./1e24c391300a252003a7aba4fcc9a178.pdf'> to pdf file ./1e24c391300a252003a7aba4fcc9a178-1240x1754
DEBUG:PreviewGenerator:conversion is based on temporary file ./1e24c391300a252003a7aba4fcc9a178-1240x1754.pdf
DEBUG:PreviewGenerator:temporary file written: ./1e24c391300a252003a7aba4fcc9a178-1240x1754.pdf
DEBUG:PreviewGenerator:converting ./1e24c391300a252003a7aba4fcc9a178-1240x1754.pdf to pdf into folder ./
DEBUG:filelock:Attempting to acquire lock 140291998118384 on ./libreoffice.lock
INFO:filelock:Lock 140291998118384 acquired on ./libreoffice.lock

And in this point the preview-generator is stuck for 20 seconds. After 20 seconds it finishes:

DEBUG:filelock:Attempting to release lock 140291998118384 on ./libreoffice.lock
INFO:filelock:Lock 140291998118384 released on ./libreoffice.lock
DEBUG:PreviewGenerator:renaming output file ./1e24c391300a252003a7aba4fcc9a178-1240x1754.pdf to ./1e24c391300a252003a7aba4fcc9a178-1240x1754
INFO:PreviewGenerator:Removing temporary copy file ./1e24c391300a252003a7aba4fcc9a178-1240x1754.pdf
DEBUG:PreviewGenerator:Removing flag file ./1e24c391300a252003a7aba4fcc9a178-1240x1754_flag
DEBUG:filelock:Attempting to release lock 140292011021552 on ./1e24c391300a252003a7aba4fcc9a178.lock
INFO:filelock:Lock 140292011021552 released on ./1e24c391300a252003a7aba4fcc9a178.lock
./F2423-a.xlsx → ./F2423-a_preview.jpeg

It is stucked here:

            libreoffice_lock = self._get_libreoffice_lock(cache_path)
            cache_path_hash = hashlib.md5(cache_path.encode("utf-8")).hexdigest()
            with libreoffice_lock:
                // THE FOLLOWING CALL TAKES 20 SECONDS:
                check_call(
                    [
                        "libreoffice",
                        "--headless",
                        "--convert-to",
                        "pdf:writer_pdf_Export",
                        temporary_input_content_path,
                        "--outdir",
                        cache_path,
                        "-env:UserInstallation=file:///tmp/LibreOffice-conversion-{}".format(
                            cache_path_hash
                        ),  # nopep8
                    ],
                    stdout=DEVNULL,
                    stderr=STDOUT,
                )

And this is how looks like the output of ps -aux:

ono       129299  3.2  0.1 120276 43356 pts/5    S+   09:34   0:00 /usr/bin/python3 /home/ono/.local/bin/preview -vvv ./F2423-a.xlsx
ono       129438  0.0  0.0 164100  4772 pts/5    Sl+  09:34   0:00 /usr/lib/libreoffice/program/oosplash --headless --convert-to pdf:writer_pdf_Export ./1e24c391300a252003a7aba4fcc9a178-1240x1754.pdf --outdir ./ -env:UserInstallation=file:///tmp/LibreOf
ono       129472 88.5  0.9 645936 303128 pts/5   Rl+  09:34   0:05 /usr/lib/libreoffice/program/soffice.bin -env:UserInstallation=file:///tmp/LibreOffice-conversion-0374ec857381230fa83049643ce3cdee --headless --convert-to pdf:writer_pdf_Export ./1e24c39
ono       129486  0.0  0.0  14916  3540 pts/6    R+   09:34   0:00 ps aux

If you would like to test it yourself, the sample file is here. Any ideas what is wrong?

inkhey commented 3 years ago

Hello, Thank for reporting this issue.

Can you confirm it's just a performance issue and preview is correctly generated after 20s ?

I don't have much idea about the reason of the issue for now. Did you try disabling libreoffice lock (just remove "with libreoffice_lock"), do you try setting the default lock timeout ? Also maybe trying running exact same command line from terminal.

onidlo commented 3 years ago

Hello inkhey,

            libreoffice_lock = self._get_libreoffice_lock(cache_path)
            cache_path_hash = hashlib.md5(cache_path.encode("utf-8")).hexdigest()
            #with libreoffice_lock:
            logger.debug("Before check_call")
            check_call(
                    [
                        "libreoffice",
                        "--headless",
                        "--convert-to",
                        "pdf:writer_pdf_Export",
                        temporary_input_content_path,
                        "--outdir",
                        cache_path,
                        "-env:UserInstallation=file:///tmp/LibreOffice-conversion-{}".format(
                            cache_path_hash
                        ),  # nopep8
                    ],
                    stdout=DEVNULL,
                    stderr=STDOUT,
            )
        logger.debug("After check_call")

And here is the output:

...
INFO:PreviewGenerator:New Preview builder of class<class 'preview_generator.preview.builder.office__libreoffice.OfficePreviewBuilderLibreoffice'>
INFO:PreviewGenerator:New Preview builder of class<class 'preview_generator.preview.builder.office__libreoffice.OfficePreviewBuilderLibreoffice'>
DEBUG:filelock:Attempting to acquire lock 139936668946592 on ./1e24c391300a252003a7aba4fcc9a178.lock
INFO:filelock:Lock 139936668946592 acquired on ./1e24c391300a252003a7aba4fcc9a178.lock
DEBUG:PreviewGenerator:converting file bytes <_io.BufferedReader name='./F2423-a.xlsx'> to pdf file ./1e24c391300a252003a7aba4fcc9a178.pdf
DEBUG:PreviewGenerator:conversion is based on temporary file ./1e24c391300a252003a7aba4fcc9a178.pdf.xlsx
DEBUG:PreviewGenerator:temporary file written: ./1e24c391300a252003a7aba4fcc9a178.pdf.xlsx
DEBUG:PreviewGenerator:converting ./1e24c391300a252003a7aba4fcc9a178.pdf.xlsx to pdf into folder ./
DEBUG:PreviewGenerator:Before check_call
DEBUG:PreviewGenerator:After check_call
DEBUG:PreviewGenerator:renaming output file ./1e24c391300a252003a7aba4fcc9a178.pdf.pdf to ./1e24c391300a252003a7aba4fcc9a178.pdf
INFO:PreviewGenerator:Removing temporary copy file ./1e24c391300a252003a7aba4fcc9a178.pdf.xlsx
DEBUG:PreviewGenerator:Removing flag file ./1e24c391300a252003a7aba4fcc9a178.pdf_flag
DEBUG:filelock:Attempting to release lock 139936668946592 on ./1e24c391300a252003a7aba4fcc9a178.lock
INFO:filelock:Lock 139936668946592 released on ./1e24c391300a252003a7aba4fcc9a178.lock
DEBUG:filelock:Attempting to acquire lock 139936681797328 on ./1e24c391300a252003a7aba4fcc9a178.lock
INFO:filelock:Lock 139936681797328 acquired on ./1e24c391300a252003a7aba4fcc9a178.lock
DEBUG:PreviewGenerator:converting file bytes <_io.BufferedReader name='./1e24c391300a252003a7aba4fcc9a178.pdf'> to pdf file ./1e24c391300a252003a7aba4fcc9a178-1240x1754
DEBUG:PreviewGenerator:conversion is based on temporary file ./1e24c391300a252003a7aba4fcc9a178-1240x1754.pdf
DEBUG:PreviewGenerator:temporary file written: ./1e24c391300a252003a7aba4fcc9a178-1240x1754.pdf
DEBUG:PreviewGenerator:converting ./1e24c391300a252003a7aba4fcc9a178-1240x1754.pdf to pdf into folder ./
DEBUG:PreviewGenerator:Before check_call

Here it is stucked for ~20 seconds and then it finishes successfully:

DEBUG:PreviewGenerator:After check_call
DEBUG:PreviewGenerator:renaming output file ./1e24c391300a252003a7aba4fcc9a178-1240x1754.pdf to ./1e24c391300a252003a7aba4fcc9a178-1240x1754
INFO:PreviewGenerator:Removing temporary copy file ./1e24c391300a252003a7aba4fcc9a178-1240x1754.pdf
DEBUG:PreviewGenerator:Removing flag file ./1e24c391300a252003a7aba4fcc9a178-1240x1754_flag
DEBUG:filelock:Attempting to release lock 139936681797328 on ./1e24c391300a252003a7aba4fcc9a178.lock
INFO:filelock:Lock 139936681797328 released on ./1e24c391300a252003a7aba4fcc9a178.lock
./F2423-a.xlsx → ./F2423-a_preview.jpeg
inkhey commented 3 years ago

Very weird bug. Just made some experiments (with libreoffice 6.3.5.2), i can reproduce but don't have clear answers yet. I don't have much time to test now, but i do think it may be useful to try latest version of libreoffice on it (don't know which version you are using), we already found some file that doesn't match well with some older version of it. As far i checked, i don't see any clear relation with the locking mecanism ...

onidlo commented 3 years ago

I have tested on two different versions of LibreOffice (LibreOffice 6.1.5.2 10(Build:2), LibreOffice 7.0.3.1 00(Build:1)), the bug is present in both versions.

danielgoqueiroz commented 3 years ago

Hello onilio, a had the same problem with a .html file. Try add "--infilter" parameter on "office_libreoffice.py" , worked for me :)

It was like this for me:

check_call( [ "libreoffice", "--headless", "--infilter=writerglobal8_HTML", <---------- "--convert-to", "pdf:writer_pdf_Export", temporary_input_content_path, "--outdir", cache_path, "-env:UserInstallation=file:///tmp/LibreOffice-conversion-{}".format( cache_path_hash ), # nopep8 ], stdout=DEVNULL, stderr=STDOUT, )

seen on: https://superuser.com/a/1436156

onidlo commented 3 years ago

Hi @danielgoqueiroz , adding "infilter" parameter tells to LibreOffice more about the input file - in your case you use probably HTML as an input, right? But in my case, I convert Excel, Word, PowerPoint, HTML etc - actually anything what LibreOffice supports to PDF. So in my case this option does not help.

danielgoqueiroz commented 3 years ago

What a pity :/ In my case I export several types of files, but a specific .html stopped processing. Wouldn't adding other types using the --filter command work for you?

An example of using the filter for various types of files can be found at the link below.

https://stackoverflow.com/a/30465397