facebookresearch / nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents
https://facebookresearch.github.io/nougat/
MIT License
8.55k stars 548 forks source link

Pypdfium2 clashes with multiprocessing support #110

Open Vidminas opened 10 months ago

Vidminas commented 10 months ago

A regression was introduced in commit https://github.com/facebookresearch/nougat/commit/9e2572bf5d100a5a7521576908eb5713e0dd24c8: previously with PyMuPDF or pdf2image rasterizer implementations, it was possible to run nougat in a multiprocessing pool, so that multiple PDFs could be parsed at the same time.

With pypdfium2 this is no longer possible. Running with multiprocessing results in errors like this:

ERROR:root:daemonic processes are not allowed to have children
ERROR:root:list index out of range
WARNING:root:Image not found
ERROR:root:list index out of range
WARNING:root:Image not found
ERROR:root:list index out of range
WARNING:root:Image not found

It happens because the pypdfium2's Document.render method has these lines:

with mp.Pool(n_processes, **pool_kwargs) as pool:
    yield from pool.imap(_parallel_renderer_job, page_indices)

and in Python it is not possible to nest multiprocessing pools (at least not with the built-in implementation). Although it is possible to set n_processes to 1 in Document.render, there is no option not to create sub-processes altogether.

For comparison, the torch.DataLoader class solves this by allowing num_workers to be set to 0 and handling it as a special case:

def _get_iterator(self) -> '_BaseDataLoaderIter':
        if self.num_workers == 0:
            return _SingleProcessDataLoaderIter(self)
        else:
            self.check_worker_number_rationality()
            return _MultiProcessingDataLoaderIter(self)

but I guess it might be more difficult to solve this from the pypdfium2 side than to switch back to the earlier pdf2image implementation, unless there is a good reason to use pypdfium2?

lukas-blecher commented 10 months ago

I switched to pypdfium2 because of the poppler dependency of pdf2image. I'll have a look

mara004 commented 9 months ago

Hi, pypdfium2 maintainer here.

You can simply use the page-level rendering method, which does not use multiprocessing:

n_pages = len(pdf)
page = pdf[i]
image = page.render(...).to_...(...)

I regret to say that the document-level pdf.render() API was an inherent design mistake since it implies transferring bitmaps across processes. Also, as you have noticed here, pypdfium2 providing an API with "hidden" process pool is kind of problematic. pdf.render() is deprecated for these reasons, however callers are encouraged to implement their own parallelization without bitmap transfer.