Open Vidminas opened 10 months ago
I switched to pypdfium2 because of the poppler dependency of pdf2image. I'll have a look
Hi, pypdfium2 maintainer here.
You can simply use the page-level rendering method, which does not use multiprocessing:
n_pages = len(pdf)
page = pdf[i]
image = page.render(...).to_...(...)
I regret to say that the document-level pdf.render()
API was an inherent design mistake since it implies transferring bitmaps across processes. Also, as you have noticed here, pypdfium2 providing an API with "hidden" process pool is kind of problematic. pdf.render()
is deprecated for these reasons, however callers are encouraged to implement their own parallelization without bitmap transfer.
A regression was introduced in commit https://github.com/facebookresearch/nougat/commit/9e2572bf5d100a5a7521576908eb5713e0dd24c8: previously with PyMuPDF or pdf2image rasterizer implementations, it was possible to run nougat in a multiprocessing pool, so that multiple PDFs could be parsed at the same time.
With pypdfium2 this is no longer possible. Running with multiprocessing results in errors like this:
It happens because the pypdfium2's Document.render method has these lines:
and in Python it is not possible to nest multiprocessing pools (at least not with the built-in implementation). Although it is possible to set n_processes to 1 in
Document.render
, there is no option not to create sub-processes altogether.For comparison, the
torch.DataLoader
class solves this by allowingnum_workers
to be set to 0 and handling it as a special case:but I guess it might be more difficult to solve this from the pypdfium2 side than to switch back to the earlier pdf2image implementation, unless there is a good reason to use pypdfium2?