SamEdwardes / spacypdfreader

Easy PDF to text to spaCy text extraction in Python.
https://samedwardes.github.io/spacypdfreader/
MIT License
33 stars 1 forks source link

Suggestion - Improve speed perfomance extracting texts #3

Closed victorescosta closed 2 years ago

victorescosta commented 2 years ago

Is there any plans about improving speed performance? The base of its library is pdfminer.six as a base, right? Is it possible to speed up perfomance in the future, using spacy? If so, how can it be done? I'm here to help in it, if I can be useful.

SamEdwardes commented 2 years ago

Hey Victor - yes I would like to speed up performance! Currently the base is pdfminer.six. The task of converting a PDF to text is the bottleneck (as opposed to anything spaCy is doing).

I think the approach I would like to take is have a default base (like pdfminer.six), but they allow users to plug in their own PDF to text extraction function as well. For example I find pytesseract has the best accuracy, but it is really slow. Users should be able to choose their own extraction.

SamEdwardes commented 2 years ago

Closed by #4. You can now choose between different PDF parsers or implement a custom one.