Add pdfium to test runs

JorjMcKie commented 1 year ago

Just the scripts I used for testing this. Seemed it would be easier this way to properly fit this into main.py. The pdfium version can be checked via pdfium.version.V_PYPDFIUM2. The recommended import statement seems to be import pypdfium2 as pdfium.

My installation was as simple as python -m pip install pypdfium2.

These are my test results on Windows 11:

------------------------------ Copy-Speed ------------------------------
                             pymupdf   pdfium    pdfrw  pikepdf   pypdf2
                  adobe.pdf     1.56     5.23      5.6    21.36   385.87
        artifex-website.pdf     0.24     0.16     0.42     1.36     3.05
        chinese-example.pdf     1.56     0.56     1.91     4.07    20.53
             db-systems.pdf     0.12     0.17     0.48     1.78     2.92
              fontforge.pdf     0.04     0.06     0.15     0.26     1.15
                 pandas.pdf     0.28     0.96     2.41     2.82    69.45
                pymupdf.pdf     0.09     0.25     0.59      0.8      6.4
             pythonbook.pdf     0.14     0.96     1.23     1.43    38.25
  sample-50-MB-pdf-file.pdf     0.09     1.78      0.1     3.02     0.06
------------------------------------------------------------------------
              Totals (sec):     4.12    10.13    12.89     36.9   527.68
          Relative to best:        1     2.46     3.13     8.96   128.08
========================================================================

--------------------------------- Text-Speed --------------------------------
                              pymupdf    pdfium   poppler    pypdf2  pdfminer
                  adobe.pdf      3.23      2.68      6.09     23.44     52.95
        artifex-website.pdf      0.23      0.22      0.31      1.07      4.25
        chinese-example.pdf      5.17      6.56      6.63     162.6      92.8
             db-systems.pdf      1.78      2.63      4.16     27.21     46.41
              fontforge.pdf      0.26      0.32      0.42      2.78      4.68
                 pandas.pdf      2.55      3.29     10.67     26.62     83.52
                pymupdf.pdf      0.48      0.85      2.28      6.53     14.44
             pythonbook.pdf       0.9      1.11      2.89      9.38     25.97
  sample-50-MB-pdf-file.pdf      0.26      0.68      0.43      9.33     14.19
-----------------------------------------------------------------------------
              Totals (sec):     14.86     18.34     33.88    268.96    339.21
          Relative to best:         1      1.23      2.28      18.1     22.83
=============================================================================

------------------------- Render-Speed ------------------------
                             pymupdf   pdfium  poppler  pdf2jpg
                  adobe.pdf    50.24    84.58    97.83    79.31
        artifex-website.pdf    28.13    45.29    53.11    54.65
        chinese-example.pdf   165.27   253.27   265.64   188.58
             db-systems.pdf    86.27   113.84   147.24   415.69
              fontforge.pdf    12.93    19.02    21.87    20.82
                 pandas.pdf   136.88   216.21   244.09   213.02
                pymupdf.pdf    23.37    38.08    37.53    33.94
             pythonbook.pdf    31.61     50.5    51.38    56.17
  sample-50-MB-pdf-file.pdf     0.87     1.32     1.44     4.54
---------------------------------------------------------------
              Totals (sec):   535.57   822.11   920.13  1066.72
          Relative to best:        1     1.54     1.72     1.99
===============================================================

Note on rendering:

pdfium has major issues with one or two example files - just like pdf2jpg.
pdfium has an internal image format much like Pixmap in MuPDF, but no own output processor, instead requires PIL/Pillow for outputting images.

mara004 commented 1 year ago

Thanks for this benchmark! I've got a few random thoughts:

The number of test documents appears to be relatively small/selective. It might be interesting to run this with more/different documents. Esp. I would recommend to avoid "peculiar" documents that are problematic for some engines, as this can render results non-representative.
Your pdfbox wrapper, pdf2jpg, calls a jar via subprocess, which adds overhead to the benchmark because of java/pdfbox startup time. I would expect this to be relevant for small test cases. It might be better to call pdfbox directly via JPype so java/pdfbox is only loaded once before the actual benchmarking.
Thanks for having pdfium :)

mara004 commented 1 year ago

Small addendum:

I'm not sure if pdf2image might use concurrency, or be changed to do so in the future? In that case, it would distort results since the other engines only render with a single job. Using the poppler API directly (e.g. with python-poppler) might be better.
Concerning pdfbox, I've written a small demo script using JPype: https://gist.github.com/mara004/51c3216a9eabd3dcbc78a86d877a61dc

ArtifexSoftware / PyMuPDF-performance

Add pdfium to test runs #1