Closed arky closed 1 year ago
Can you please provide a reproducible example and include the output?
@SamEdwardes Here is simple testcase I have used. https://gist.github.com/arky/c91d20a8769846aec32262c76eea815d The issue surfaces when you are processing a pdf with large number of pages.
The program runs forever taking all available CPU.
Thank you @arky. Is there any way you are able to share the specific code and output related to the issue? Your gist refers to "test.pdf", and the PDF you shared in your first message is 3 pages.
@SamEdwardes I was able to reproduce the error using the following file as test-case Code snippet: https://gist.github.com/arky/c91d20a8769846aec32262c76eea815d Test-case: https://docs.aiddata.org/ad4/pdfs/Banking_on_the_Belt_and_Road__Insights_from_a_new_global_dataset_of_13427_Chinese_development_projects.pdf However any PDF with sufficiently large number of pages should be able to generate similar problems.
Unfortunately I wasn't able to reproduce any debug logs as the process becomes unresponsive.
Thank you for providing the updated PDF. The new PDF is 166 pages. Here is the code I ran:
import requests
import spacy
from rich import print
from spacypdfreader import pdf_reader
# download the pdf
url = 'https://docs.aiddata.org/ad4/pdfs/Banking_on_the_Belt_and_Road__Insights_from_a_new_global_dataset_of_13427_Chinese_development_projects.pdf'
r = requests.get(url, stream=True)
with open('test.pdf', 'wb') as f:
f.write(r.content)
# Load the PDF document
nlp = spacy.load('en_core_web_sm')
doc = pdf_reader('test.pdf', nlp)
# View the results
page_count = doc._.last_page
for page in range(1, page_count + 1):
print(page)
print(doc._.page(page)[0:50])
This code did execute for me. But it took 4 minutes and 49 seconds.
Agreed - it is very slow. Unfortunately, PDF to text in general is slow. I have a few ideas:
I also have an open issue to implement multiprocessing. This would likely help speed things up (#8).
Thank you @SamEdwardes for doing the research. Perhaps for now adding a note about correctly handling large sized documents could be added into the documentation.
Good suggestion thank you Arky!
@SamEdwardes Touching base to see if we could resolve this issue either with implementation of multiprocessing or by expanding the docs as stop-gap measure.
Thanks!
@arky thank you for the reminder! I can make an update to the docs today!
@SamEdwardes You are most welcome, please let me know if I could help in any way.
I adding a tip to the docs: https://github.com/SamEdwardes/spacypdfreader/commit/6d9f5b7deba77690ebc0ac0f1046a75142b223e0
I think we can close this issue now.
Thank you @SamEdwardes
Unable to import this PDF document using spacypdfreader. The import results in high cpu usage and caused the system to hang.
https://docs.aiddata.org/ad4/pdfs/Banking_on_the_Belt_and_Road_Executive_Summary.pdf