Closed SamEdwardes closed 1 year ago
Hey thanks for the question! I am actually not sure how / if for e in nlp.pipe(texts, n_process=CPU_CORES-1, batch_size=100):
would speed things up.
My idea is to use multiprocessing to speed up pytesseract (e.g. performing the OCR on multiple pages at the same time).
Thanks Sam, my understanding is that it would speed things up if you are processing hundreds of pdf files at the same time.
I think the tricky thing with for e in nlp.pipe(texts, n_process=CPU_CORES-1, batch_size=100):
is that I am not sure how spaCy implemented it.
Take a look at the implementation notes for spacypdfreader:
Because of this I am not sure if it will place nice with the nlp.pip
and setting cpu_cores
.
Consider using Ray to implement multiprocessing. They have a good tutorial here: https://docs.ray.io/en/latest/data/examples/ocr_example.html.
not my question is relevant here, but is there a way of using multiprocessing pipe functionality from spacy? like in this:
for e in nlp.pipe(texts, n_process=CPU_CORES-1, batch_size=100):