Open tigerinus opened 9 months ago
Having the same issue when importing partition_pdf
from unstructured.partition.pdf import partition_pdf
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[8], line 2
1 import os
----> 2 from unstructured.partition.pdf import partition_pdf
3 from unstructured.staging.base import elements_to_json
File /opt/conda/lib/python3.10/site-packages/unstructured/partition/pdf.py:77
64 from unstructured.partition.common import (
65 convert_to_bytes,
66 document_to_element_list,
(...)
71 spooled_to_bytes_io_if_needed,
72 )
73 from unstructured.partition.lang import (
74 check_language_args,
75 prepare_languages_for_tesseract,
76 )
---> 77 from unstructured.partition.pdf_image.pdf_image_utils import (
78 annotate_layout_elements,
79 check_element_types_to_extract,
80 save_elements,
81 )
82 from unstructured.partition.pdf_image.pdfminer_processing import (
83 merge_inferred_with_extracted_layout,
84 )
85 from unstructured.partition.pdf_image.pdfminer_utils import (
86 open_pdfminer_pages_generator,
87 rect_to_bbox,
88 )
File /opt/conda/lib/python3.10/site-packages/unstructured/partition/pdf_image/pdf_image_utils.py:9
6 from pathlib import PurePath
7 from typing import TYPE_CHECKING, BinaryIO, List, Optional, Tuple, Union, cast
----> 9 import cv2
10 import numpy as np
11 import pdf2image
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
Is there a workaround @tigerinus ?
@tigerinus, @mhfarahani what base image are you using? I'd like to replicate the described behavior on my side
@tigerinus, @mhfarahani what base image are you using? I'd like to replicate the described behavior on my side
any distro that doesn't come with the required binary libGL.so.1
should be able to reproduce this issue
In our case, it's a highly customized embedded linux (buildroot
based).
As far as I can tell, there's a quite relevant dependency: layoutparser
which relies on opencv-python
.
I've seen that there is a similar request to yours: https://github.com/Layout-Parser/layout-parser/issues/170
We have two options: a) we'd need to create a PR in their package, or b) let them know that it's important/pressuring to introduce the headless version in their repo and wait until it's fixed there
@tigerinus @adi-kmt @mhfarahani If you need a workaround now, I'd say you should modify your Dockerfiles in the following way:
# install unstructured library as usual
# uninstall the full version, install headless
RUN pip uninstall -y opencv-python opencv-contrib-python && pip install opencv-python-headless==4.8.0.76
if opencv-python-headless
is not sufficient, try with opencv-contrib-python-headless
Please let me know if that helps 🤝
Facing the same problem. The workaround works, thank you!
I've tried the workaround but now the error when importing partition_pdf is: ModuleNotFoundError: No module named 'cv2.typing'; 'cv2' is not a package
Hitting the same issue. Is there any news on whether this could be changed to the headless version?
Thanks everyone, we're going to take a look at this.
The workaround works, just make sure that you do your uninstall after you've done your requirements install
RUN pip install -r requirements.txt
RUN pip uninstall -y opencv-python opencv-contrib-python && pip install opencv-python-headless==4.8.0.76
*edited as I forgot what project I was looking at
Describe the bug
Getting following error when loading PDF files on a container image to be hosted in cloud:
However
libGL.so.1
is part of Xorg binaries. We could switch to a full Linux distro to resolve this, but a better option is to haveopencv-python-headless
in dependency requirements instead ofopencv-python
.