SamEdwardes / spacypdfreader

Easy PDF to text to spaCy text extraction in Python.
https://samedwardes.github.io/spacypdfreader/
MIT License
33 stars 1 forks source link

Fails to import PDF document #12

Closed arky closed 1 year ago

arky commented 1 year ago

Unable to import this PDF document using spacypdfreader. The import results in high cpu usage and caused the system to hang.

https://docs.aiddata.org/ad4/pdfs/Banking_on_the_Belt_and_Road_Executive_Summary.pdf

SamEdwardes commented 1 year ago

Can you please provide a reproducible example and include the output?

arky commented 1 year ago

@SamEdwardes Here is simple testcase I have used. https://gist.github.com/arky/c91d20a8769846aec32262c76eea815d The issue surfaces when you are processing a pdf with large number of pages.
The program runs forever taking all available CPU.

SamEdwardes commented 1 year ago

Thank you @arky. Is there any way you are able to share the specific code and output related to the issue? Your gist refers to "test.pdf", and the PDF you shared in your first message is 3 pages.

arky commented 1 year ago

@SamEdwardes I was able to reproduce the error using the following file as test-case Code snippet: https://gist.github.com/arky/c91d20a8769846aec32262c76eea815d Test-case: https://docs.aiddata.org/ad4/pdfs/Banking_on_the_Belt_and_Road__Insights_from_a_new_global_dataset_of_13427_Chinese_development_projects.pdf However any PDF with sufficiently large number of pages should be able to generate similar problems.

Unfortunately I wasn't able to reproduce any debug logs as the process becomes unresponsive.

SamEdwardes commented 1 year ago

Thank you for providing the updated PDF. The new PDF is 166 pages. Here is the code I ran:

import requests
import spacy
from rich import print
from spacypdfreader import pdf_reader

# download the pdf
url = 'https://docs.aiddata.org/ad4/pdfs/Banking_on_the_Belt_and_Road__Insights_from_a_new_global_dataset_of_13427_Chinese_development_projects.pdf'
r = requests.get(url, stream=True)

with open('test.pdf', 'wb') as f:
    f.write(r.content)

# Load the PDF document
nlp = spacy.load('en_core_web_sm')
doc = pdf_reader('test.pdf', nlp)

# View the results
page_count = doc._.last_page
for page in range(1, page_count + 1):
    print(page)
    print(doc._.page(page)[0:50])

This code did execute for me. But it took 4 minutes and 49 seconds.

Agreed - it is very slow. Unfortunately, PDF to text in general is slow. I have a few ideas:

I also have an open issue to implement multiprocessing. This would likely help speed things up (#8).

arky commented 1 year ago

Thank you @SamEdwardes for doing the research. Perhaps for now adding a note about correctly handling large sized documents could be added into the documentation.

SamEdwardes commented 1 year ago

Good suggestion thank you Arky!

arky commented 1 year ago

@SamEdwardes Touching base to see if we could resolve this issue either with implementation of multiprocessing or by expanding the docs as stop-gap measure.

Thanks!

SamEdwardes commented 1 year ago

@arky thank you for the reminder! I can make an update to the docs today!

arky commented 1 year ago

@SamEdwardes You are most welcome, please let me know if I could help in any way.

SamEdwardes commented 1 year ago

I adding a tip to the docs: https://github.com/SamEdwardes/spacypdfreader/commit/6d9f5b7deba77690ebc0ac0f1046a75142b223e0

I think we can close this issue now.

arky commented 1 year ago

Thank you @SamEdwardes