Bug: PDF file upload failed - Could not initialize tesseract

azaylamba commented 6 days ago

I am getting the following error while uploading certain PDF files. This is reproducible every time with some PDF files.

Working fine for most of the PDF files.

Starting file converter batch job
Workspace ID: d951f6fb-f8c0-4fa6-ad64-3d3a243154df
Document ID: 31c07ab2-434d-4bb1-b156-a90ee161010c
Input bucket name: devchatbotstack-ragenginesdataimportupload-6qhws4pdvker
Input object key: d951f6fb-f8c0-4fa6-ad64-3d3a243154df/Introducing NitroX.pdf
Output bucket name: devchatbotstack-ragenginesdataimportproces-ptrkl9g0s1v7
Output object key: d951f6fb-f8c0-4fa6-ad64-3d3a243154df/31c07ab2-434d-4bb1-b156-a90ee161010c/content.txt
loader: <langchain_community.document_loaders.s3_file.S3FileLoader object at 0x7fce8b60a110>
(1, 'Error opening data file /usr/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
Traceback (most recent call last):
  File "/app/main.py", line 81, in <module>
    main()
  File "/app/main.py", line 64, in main
    raise error
  File "/app/main.py", line 49, in main
    docs = loader.load()
           ^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 31, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/langchain_community/document_loaders/unstructured.py", line 107, in lazy_load
    elements = self._get_elements()
               ^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/langchain_community/document_loaders/s3_file.py", line 135, in _get_elements
    return partition(filename=file_path, **self.unstructured_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/auto.py", line 341, in partition
    elements = partition_pdf(
               ^^^^^^^^^^^^^^
  File "/app/unstructured/documents/elements.py", line 605, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/file_utils/filetype.py", line 706, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/file_utils/filetype.py", line 662, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 210, in partition_pdf
    return partition_pdf_or_image(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 346, in partition_pdf_or_image
    elements = _partition_pdf_or_image_with_ocr(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 899, in _partition_pdf_or_image_with_ocr
    page_elements = _partition_pdf_or_image_with_ocr_from_image(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 933, in _partition_pdf_or_image_with_ocr_from_image
    ocr_data = ocr_agent.get_layout_elements_from_image(image=image)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/utils.py", line 217, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 96, in get_layout_elements_from_image
    ocr_regions = self.get_layout_from_image(image)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 50, in get_layout_from_image
    ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 596, in image_to_data
    return {
           ^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 598, in <lambda>
    Output.DATAFRAME: lambda: get_pandas_output(
                              ^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 573, in get_pandas_output
    return pd.read_csv(BytesIO(run_and_get_output(*args)), **kwargs)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 352, in run_and_get_output
    run_tesseract(**kwargs)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 284, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
unstructured_pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /usr/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

Sample file to reproduce the issue FileUploadErrorSample.pdf

azaylamba commented 6 days ago

One observation is that the issue seems to be with the PDF files generated via print function on Windows system. The PDF producer is Microsoft: Print to PDF for the files where I am getting the issue.

charles-marion commented 6 days ago

@azaylamba ,

It looks like it needs the training data to convert these files.

Removing this line might fix the problem but the docker image will be bigger (and processing slower). Note it's not the same folder.

https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/lib/shared/file-import-dockerfile#L5

An alternative solution listed here would be to run https://github.com/Unstructured-IO/unstructured/issues/3290#issue-2371970753 apk add tesseract-eng in the docker file (but it seems resolved, maybe it's using an older base image?)

azaylamba commented 5 days ago

Sample file to reproduce the issue FileUploadErrorSample.pdf

azaylamba commented 5 days ago

@charles-marion I tried with the latest version 0.16.9 of unstructured but the issue still persisted.

Issue is resolved after adding RUN apk add --no-cache tesseract-eng in https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/lib/shared/file-import-dockerfile

So it seems tesseract-eng is required to process such PDF files.

azaylamba commented 5 days ago

@charles-marion Please let me know if you think this is the correct approach to fix this and you want me to raise a PR.

aws-samples / aws-genai-llm-chatbot

Bug: PDF file upload failed - Could not initialize tesseract #614