Just the scripts I used for testing this. Seemed it would be easier this way to properly fit this into main.py. The pdfium version can be checked via pdfium.version.V_PYPDFIUM2. The recommended import statement seems to be import pypdfium2 as pdfium.
My installation was as simple as python -m pip install pypdfium2.
Thanks for this benchmark! I've got a few random thoughts:
The number of test documents appears to be relatively small/selective. It might be interesting to run this with more/different documents. Esp. I would recommend to avoid "peculiar" documents that are problematic for some engines, as this can render results non-representative.
Your pdfbox wrapper, pdf2jpg, calls a jar via subprocess, which adds overhead to the benchmark because of java/pdfbox startup time. I would expect this to be relevant for small test cases. It might be better to call pdfbox directly via JPype so java/pdfbox is only loaded once before the actual benchmarking.
I'm not sure if pdf2image might use concurrency, or be changed to do so in the future? In that case, it would distort results since the other engines only render with a single job. Using the poppler API directly (e.g. with python-poppler) might be better.
Just the scripts I used for testing this. Seemed it would be easier this way to properly fit this into main.py. The pdfium version can be checked via
pdfium.version.V_PYPDFIUM2
. The recommended import statement seems to beimport pypdfium2 as pdfium
.My installation was as simple as
python -m pip install pypdfium2
.These are my test results on Windows 11:
Note on rendering: