Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.24k stars 766 forks source link

bug/opencv-python should be `headless` to avoid dependency on Xorg #2503

Open tigerinus opened 9 months ago

tigerinus commented 9 months ago

Describe the bug

Getting following error when loading PDF files on a container image to be hosted in cloud:

  ...
  File "/DATA/junk/test2/lib/python3.11/site-packages/unstructured/partition/auto.py", line 81, in <module>
    from unstructured.partition.pdf import partition_pdf
  File "/DATA/junk/test2/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 76, in <module>
    from unstructured.partition.ocr import (
  File "/DATA/junk/test2/lib/python3.11/site-packages/unstructured/partition/ocr.py", line 6, in <module>
    import cv2
ImportError: libGL.so.1: cannot open shared object file: No such file or directory

However libGL.so.1 is part of Xorg binaries. We could switch to a full Linux distro to resolve this, but a better option is to have opencv-python-headless in dependency requirements instead of opencv-python.

mhfarahani commented 9 months ago

Having the same issue when importing partition_pdf

from unstructured.partition.pdf import partition_pdf

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[8], line 2
      1 import os
----> 2 from unstructured.partition.pdf import partition_pdf
      3 from unstructured.staging.base import elements_to_json

File /opt/conda/lib/python3.10/site-packages/unstructured/partition/pdf.py:77
     64 from unstructured.partition.common import (
     65     convert_to_bytes,
     66     document_to_element_list,
   (...)
     71     spooled_to_bytes_io_if_needed,
     72 )
     73 from unstructured.partition.lang import (
     74     check_language_args,
     75     prepare_languages_for_tesseract,
     76 )
---> 77 from unstructured.partition.pdf_image.pdf_image_utils import (
     78     annotate_layout_elements,
     79     check_element_types_to_extract,
     80     save_elements,
     81 )
     82 from unstructured.partition.pdf_image.pdfminer_processing import (
     83     merge_inferred_with_extracted_layout,
     84 )
     85 from unstructured.partition.pdf_image.pdfminer_utils import (
     86     open_pdfminer_pages_generator,
     87     rect_to_bbox,
     88 )

File /opt/conda/lib/python3.10/site-packages/unstructured/partition/pdf_image/pdf_image_utils.py:9
      6 from pathlib import PurePath
      7 from typing import TYPE_CHECKING, BinaryIO, List, Optional, Tuple, Union, cast
----> 9 import cv2
     10 import numpy as np
     11 import pdf2image

ImportError: libGL.so.1: cannot open shared object file: No such file or directory
adi-kmt commented 8 months ago

Is there a workaround @tigerinus ?

micmarty-deepsense commented 8 months ago

@tigerinus, @mhfarahani what base image are you using? I'd like to replicate the described behavior on my side

tigerinus commented 8 months ago

@tigerinus, @mhfarahani what base image are you using? I'd like to replicate the described behavior on my side

any distro that doesn't come with the required binary libGL.so.1 should be able to reproduce this issue

In our case, it's a highly customized embedded linux (buildroot based).

micmarty-deepsense commented 8 months ago

As far as I can tell, there's a quite relevant dependency: layoutparser which relies on opencv-python. I've seen that there is a similar request to yours: https://github.com/Layout-Parser/layout-parser/issues/170

We have two options: a) we'd need to create a PR in their package, or b) let them know that it's important/pressuring to introduce the headless version in their repo and wait until it's fixed there

@tigerinus @adi-kmt @mhfarahani If you need a workaround now, I'd say you should modify your Dockerfiles in the following way:

# install unstructured library as usual

# uninstall the full version, install headless
RUN pip uninstall -y opencv-python opencv-contrib-python && pip install opencv-python-headless==4.8.0.76

if opencv-python-headless is not sufficient, try with opencv-contrib-python-headless

Please let me know if that helps 🤝

FilippTrigub commented 8 months ago

Facing the same problem. The workaround works, thank you!

laurazpm commented 7 months ago

I've tried the workaround but now the error when importing partition_pdf is: ModuleNotFoundError: No module named 'cv2.typing'; 'cv2' is not a package

Robs-Git-Hub commented 6 months ago

Hitting the same issue. Is there any news on whether this could be changed to the headless version?

MthwRobinson commented 6 months ago

Thanks everyone, we're going to take a look at this.

pjaol commented 6 months ago

The workaround works, just make sure that you do your uninstall after you've done your requirements install

RUN pip install  -r requirements.txt
RUN pip uninstall -y opencv-python opencv-contrib-python && pip install opencv-python-headless==4.8.0.76

*edited as I forgot what project I was looking at