Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

bug/docker_tesseract_missing #3290

Open neilkumar opened 4 days ago

neilkumar commented 4 days ago

Describe the bug The docker images are missing tesseract.

To Reproduce

docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest
docker exec -it unstructured python3

from unstructured.partition.image import partition_image
elements = partition_image(filename="example-docs/DA-1p.png")
Traceback (most recent call last):
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 446, in get_tesseract_version
    output = subprocess.check_output(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.11/subprocess.py", line 1955, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'tesseract'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/app/unstructured/documents/elements.py", line 593, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/file_utils/filetype.py", line 582, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/image.py", line 103, in partition_image
    return partition_pdf_or_image(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 288, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/utils.py", line 249, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 580, in _partition_pdf_or_image_local
    final_document_layout = process_file_with_ocr(
                            ^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/utils.py", line 249, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf_image/ocr.py", line 174, in process_file_with_ocr
    raise e
  File "/app/unstructured/partition/pdf_image/ocr.py", line 140, in process_file_with_ocr
    merged_page_layout = supplement_page_layout_with_ocr(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/utils.py", line 249, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf_image/ocr.py", line 198, in supplement_page_layout_with_ocr
    ocr_layout = ocr_agent.get_layout_from_image(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 49, in get_layout_from_image
    ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 585, in image_to_data
    if get_tesseract_version(cached=True) < TESSERACT_MIN_VERSION:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 158, in wrapper
    wrapper._result = func(*args, **kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 453, in get_tesseract_version
    raise TesseractNotFoundError()
unstructured_pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

I was able to trace this to the tesseract binary being missing. Running:

apk upgrade

I upgraded the tesseract library (as well as a bunch of other outdated libraries) to the latest version. That fixed the issue of the missing tesseract binary.

Running the above code again.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/app/unstructured/documents/elements.py", line 593, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/file_utils/filetype.py", line 582, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/image.py", line 103, in partition_image
    return partition_pdf_or_image(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 288, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/utils.py", line 249, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 580, in _partition_pdf_or_image_local
    final_document_layout = process_file_with_ocr(
                            ^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/utils.py", line 249, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf_image/ocr.py", line 174, in process_file_with_ocr
    raise e
  File "/app/unstructured/partition/pdf_image/ocr.py", line 140, in process_file_with_ocr
    merged_page_layout = supplement_page_layout_with_ocr(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/utils.py", line 249, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf_image/ocr.py", line 198, in supplement_page_layout_with_ocr
    ocr_layout = ocr_agent.get_layout_from_image(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 49, in get_layout_from_image
    ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 591, in image_to_data
    return {
           ^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 593, in <lambda>
    Output.DATAFRAME: lambda: get_pandas_output(
                              ^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 568, in get_pandas_output
    return pd.read_csv(BytesIO(run_and_get_output(*args)), **kwargs)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 347, in run_and_get_output
    run_tesseract(**kwargs)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 279, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
unstructured_pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /usr/local/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

So the next issue was TESSDATA_PREFIX is set to /usr/local/share/tessdata when it should be /usr/share/tessdata.

Fixing that and running again but got the same issue. It turns out there are no language files in the tessdata either.

Running

apk add tesseract-eng

Fixed it, and then partition_image executed without error.

Expected behavior Thatpartition_image would work on the sample image in example-docs.

Additional context Add any other context about the problem here.

christinestraub commented 3 days ago

@MthwRobinson This issue seems to be related to the wolfi base image. What do you think?

MthwRobinson commented 3 days ago

@neilkumar - Are you using the arm64 or amd64 image?

MthwRobinson commented 3 days ago

And actually didn't realize tesseract was on the wolfi package manager now, we should switch to using that instead of the APK file we built regardless.

neilkumar commented 3 days ago

@MthwRobinson Locally in my dev environment I'm using the arm64 builds, but for production we're using amd64.

MthwRobinson commented 3 days ago

Got it thanks. Do you have the same issue for the amd64 image? The amd64 build runs all of our unit tests during CI and that includes partition_image.

I should be able to take a look at this before the end of the week.

neilkumar commented 3 days ago

I did not get to deploying to production yet, so I'm not sure on the amd64.

For my use case, I have a Dockerfile that builds on yours by a few internal utilities, and that's where I addressed the issue for now (by uninstalling your tesseract and installing the latest from wolfi, and then installing the language packs).