SamEdwardes / spacypdfreader

Easy PDF to text to spaCy text extraction in Python.
https://samedwardes.github.io/spacypdfreader/
MIT License
33 stars 1 forks source link

Implement multi processing to speed up pytesseract #8

Closed SamEdwardes closed 1 year ago

flppgg commented 2 years ago

not my question is relevant here, but is there a way of using multiprocessing pipe functionality from spacy? like in this:

for e in nlp.pipe(texts, n_process=CPU_CORES-1, batch_size=100):

SamEdwardes commented 2 years ago

Hey thanks for the question! I am actually not sure how / if for e in nlp.pipe(texts, n_process=CPU_CORES-1, batch_size=100): would speed things up.

My idea is to use multiprocessing to speed up pytesseract (e.g. performing the OCR on multiple pages at the same time).

flppgg commented 2 years ago

Thanks Sam, my understanding is that it would speed things up if you are processing hundreds of pdf files at the same time.

SamEdwardes commented 2 years ago

I think the tricky thing with for e in nlp.pipe(texts, n_process=CPU_CORES-1, batch_size=100): is that I am not sure how spaCy implemented it.

Take a look at the implementation notes for spacypdfreader:

https://github.com/SamEdwardes/spacypdfreader/blob/d32cf8e0fa0da2571e444e1f295cd6b3f2baf1ec/README.md?plain=1#L85-L112

Because of this I am not sure if it will place nice with the nlp.pip and setting cpu_cores.

SamEdwardes commented 1 year ago

Consider using Ray to implement multiprocessing. They have a good tutorial here: https://docs.ray.io/en/latest/data/examples/ocr_example.html.