Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.76k stars 617 forks source link

bug: TesseractError: Estimating resolution as X #2900

Open qued opened 3 months ago

qued commented 3 months ago

Describe the bug User gets a TesseractError when processing a particular document.

To Reproduce Code was an API call with a certain image-based document.

Expected behavior Document processed successfully.

Environment Info Running in self-hosted open-source API. Unstructured version 0.12.3. Tesseract version 5.3.3

Additional context User was able to successfully process the document with Tesseract version 4.1.1

Stack trace:

File "/home/notebook-user/unstructured/partition/pdf.py", line 213, in partition_pdf
    return partition_pdf_or_image(
  File "/home/notebook-user/unstructured/partition/pdf.py", line 298, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
  File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf.py", line 494, in _partition_pdf_or_image_local
    final_document_layout = process_data_with_ocr(
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 82, in process_data_with_ocr
    merged_layouts = process_file_with_ocr(
  File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 178, in process_file_with_ocr
    raise e
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 166, in process_file_with_ocr
    merged_page_layout = supplement_page_layout_with_ocr(
  File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 202, in supplement_page_layout_with_ocr
    ocr_layout = ocr_agent.get_layout_from_image(
  File "/home/notebook-user/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 48, in get_layout_from_image
    ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 591, in image_to_data
    return {
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 593, in <lambda>
    Output.DATAFRAME: lambda: get_pandas_output(
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 568, in get_pandas_output
    return pd.read_csv(BytesIO(run_and_get_output(*args)), **kwargs)
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 347, in run_and_get_output
    run_tesseract(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 279, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
unstructured_pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 252')
qued commented 3 months ago

Slack conversation: https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1713364225537139

We've previously encountered this error in #1920 and closed the issue with #1996. The user is running a version of unstructured with the fix merged, so presumably this is the same error showing up for a different reason.

esakes1 commented 1 month ago

@qued, @scanny : Any update on the above issue ?

scanny commented 1 month ago

@esakes1 Can you say more about what you're seeing and when? In particular which specific error message (including estimated resolution).

And can you provide an example document with which we can reproduce?