Closed DoubleCortado closed 11 months ago
Hi @DoubleCortado - thank you for sharing the issue. Can you please share a reproducible example, including the error message? See this page for advice: https://stackoverflow.com/help/minimal-reproducible-example.
I did, however, take a shot at recreating the problem myself. Can you confirm if this is what you were seeing as well?
import spacy
from spacypdfreader import pdf_reader
nlp = spacy.load("en_core_web_sm")
# Extract all pages - works
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
print(doc._.page_range)
# (1, 4)
# Extract specific pages - will raise an error
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, page_numbers=[0, 1])
print(doc._.page_range)
Traceback (most recent call last):
File "/Users/samedwardes/projects/personal/spacypdfreader/test.py", line 11, in <module>
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, page_numbers=[0, 1])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samedwardes/projects/personal/spacypdfreader/spacypdfreader/spacypdfreader.py", line 158, in pdf_reader
text = pdf_parser(pdf_path=pdf_path, page_number=page_num, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samedwardes/projects/personal/spacypdfreader/spacypdfreader/parsers/pdfminer.py", line 60, in parser
text = extract_text(pdf_path, page_numbers=[page_number], **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: pdfminer.high_level.extract_text() got multiple values for keyword argument 'page_numbers'
This does look like a bug. This issue that I set the value for page_numbers
here:
I think the behaviour to only parse 1 page at a time is required to keep the multiprocessing simple. However, I can see if there is a way to have support for only extracting certain pages.
Hi @DoubleCortado - I have released a new version (0.3.1) that now supports a new parameter called page_range
. Could you updated to 0.3.1 and give it a try?
import spacy
from spacypdfreader import pdf_reader
from spacypdfreader.parsers import pytesseract
nlp = spacy.load("en_core_web_sm")
doc = pdf_reader(
"tests/data/test_pdf_01.pdf",
nlp,
page_range=(2, 3)
)
Hello. thank you for the update. not sure why I could use the previous version of the package with python 3.12 and now when trying to update package to 0.3.1 I'm getting the error:
ERROR: Ignored the following versions that require a different python version: 0.3.0 Requires-Python >=3.8,<3.12; 0.3.1 Requires-Python >=3.8,<3.12 ERROR: Could not find a version that satisfies the requirement spacypdfreader==0.3.1 (from versions: 0.1.0, 0.1.1, 0.2.0, 0.2.1) ERROR: No matching distribution found for spacypdfreader==0.3.1
Right now I only test against 3.8 to 3.11: https://github.com/SamEdwardes/spacypdfreader/blob/main/.github/workflows/pytest.yml
This is a good callout, though, python 3.12 should work as well. I can fix this in a future release. I added this issue to track: https://github.com/SamEdwardes/spacypdfreader/issues/21
For now, can you use an older version of Python?
Hello,
could you please tell me what is wrong with below function. I would like to parse only first two pages of the pdf. When I call the function with argument page_numbers=[0,1] it extracts text from all pages anyway.
The function is very slow and I would like to limit number of pages parsed.
Thank you,