algoo / preview-generator

generates previews of files with cache management
https://pypi.org/project/preview-generator/
MIT License
235 stars 51 forks source link

preview generation for an office document is excessivly slow #222

Closed lebouquetin closed 3 years ago

lebouquetin commented 3 years ago

With preview-generator 0.13, the mechanism for an office preview process is:

  1. first create the PDF pivot file
  2. for a jpeg preview of a page, if the PDF is available use it
  3. ...

Starting with preview-generator 0.14, the process is very slow. In last version, libreoffice is executed for every page preview, even if the pviot PDF file is available

How to reproduce:

import os
import os.path, time

from datetime import datetime

from preview_generator.manager import PreviewManager
current_dir = os.path.dirname(os.path.abspath(__file__)) +'/'

manager = PreviewManager(current_dir + 'cache')

print('{} Preview creation start ...'.format(datetime.now()))
path_to_preview = manager.get_pdf_preview(
  file_path="/home/damien/Téléchargements/ASF_Guide.Pratique.Détention.202101 (26⧸04⧸2021 à 15 14 16).docx",
)

print('{} Preview created at path : {}'.format(datetime.now(), path_to_preview))

for page in range(1, 15):
    print('{} Preview creation start for page {} ...'.format(datetime.now(), page))
    pdffile = "/home/damien/proj/preview-generator/cache/8f595fb751c20605235b5fa39a892f8b.pdf"

    path_to_preview = manager.get_jpeg_preview(
      file_path="/home/damien/Téléchargements/ASF_Guide.Pratique.Détention.202101 (26⧸04⧸2021 à 15 14 16).docx",
      height=512,
      width=512,
      page=page
    )
    print("{} PDF file {} datetime".format(time.ctime(os.path.getctime(pdffile)), pdffile))
    print('{} Preview created at path : {}'.format(datetime.now(), path_to_preview))

Also add a debug output before to execute libreoffice:

check_call(
                    [
                        "libreoffice",
                        "--headless",
                        "--convert-to",
                        "pdf:writer_pdf_Export",
                        temporary_input_content_path,
                        "--outdir",
                        cache_path,
                        "-env:UserInstallation=file:///tmp/LibreOffice-conversion-{}".format(
                            cache_path_hash
                        ),  # nopep8
                    ],
                    stdout=DEVNULL,
                    stderr=STDOUT,
                )
lebouquetin commented 3 years ago

Here is the diff between 0.13 and 0.14 versions: https://github.com/algoo/preview-generator/compare/release_0.13...release_0.14

lebouquetin commented 3 years ago

Note: I'd be happy you ask me about the old behavior if you need to investiguate old stuff (which is not always easy to understand)

lebouquetin commented 3 years ago

We add a talk with @inkhey about architecture ... then I realized the issue was not there.

Actually, since 2020-07-03 the process of building the pivot pdf file is implemented in the manager.

The process in manager Manager.get_jpeg_preview() is:

The resolution is to change the second builder used:

https://github.com/algoo/preview-generator/blob/35eb95b31f2e9509b4a1d761c3c3fab90b1ea571/preview_generator/manager.py#L185

        with preview_context.filelock:
            if force or not os.path.exists(preview_file_path):
                preview_context.builder.build_jpeg_preview(
                    file_path=file_path,
                    preview_name=preview_name,
                    cache_path=self.cache_path,
                    page_id=max(page, 0),  # if page is -1 then return preview of first page,
                    extension=extension,
                    size=size,
                    mimetype=preview_context.mimetype,
                )

should become:

        preview_context2 = self.get_preview_context(file_path, file_ext)  # now file_path is the PDF pivot file path
        with preview_context.filelock:
            if force or not os.path.exists(preview_file_path):
                preview_context2.builder.build_jpeg_preview(
                    file_path=file_path,
                    preview_name=preview_name,
                    cache_path=self.cache_path,
                    page_id=max(page, 0),  # if page is -1 then return preview of first page,
                    extension=extension,
                    size=size,
                    mimetype=preview_context.mimetype,
                )

This fixes the performance issue.

By the way, there are two other fixes to implement:

lebouquetin commented 3 years ago

cc @grignards

inkhey commented 3 years ago

@lebouquetin in current code, i don't see any reason to implement two-step process for other preview methods, text_preview and json make sense to be generated from default content (and are now) but i agree that we should verify this kind of things, for example, It will make sense for a "png builder" that doesn't exist yet

lebouquetin commented 3 years ago

As discussed at the moment:

So, remaining work is:

release 0.18 MUST be ready in order to close https://github.com/tracim/tracim/issues/4587

inkhey commented 3 years ago

fixed by both https://github.com/algoo/preview-generator/pull/223 and https://github.com/algoo/preview-generator/pull/225