Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

bug/<short-name> #3283

Closed rs-03 closed 3 days ago

rs-03 commented 4 days ago

Describe the bug File /opt/conda/lib/python3.12/site-packages/unstructured_inference/models/tables.py:667, in fill_cells(cells) 650 def fill_cells(cells: List[dict]) -> List[dict]: 651 """fills the missing cells in the table by adding a cells with empty text 652 where there are no cells detected by the model. 653 (...) 665 666 """ --> 667 table_rows_no = max({row for cell in cells for row in cell["row_nums"]}) 668 table_cols_no = max({col for cell in cells for col in cell["column_nums"]}) 669 filled = np.zeros((table_rows_no + 1, table_cols_no + 1), dtype=bool)

ValueError: max() iterable argument is emptyTo Reproduce from unstructured.partition.pdf import partition_pdf raw_pdf_elements = partition_pdf( filename=path + "/file_name.pdf",

Unstructured first finds embedded image blocks

extract_images_in_pdf=False,
# Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
# Titles are any sub-section of the document
infer_table_structure=True,
# Post processing to aggregate text once we have the title
chunking_strategy="by_title",
# Chunking params to aggregate text blocks
# Attempt to create a new chunk 3800 chars
# Attempt to keep chunks > 2000 chars
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
image_output_dir_path=path

) Expected behavior Text and Table elements should have been extracted

Screenshots If applicable, add screenshots to help explain your problem.

Environment Info OS version: Linux-6.1.92-99.174.amzn2023.x86_64-x86_64-with-glibc2.35 Python version: 3.12.3 unstructured version: 0.14.7 unstructured-inference version: 0.7.35 pytesseract version: 0.3.10 Torch version: 2.3.1 Detectron2 version: None PaddleOCR version: None Libmagic version: file-5.41 magic file from /etc/magic:/usr/share/misc/magic LibreOffice version: LibreOffice 7.3.7.2 30(Build:2)

christinestraub commented 4 days ago

Hi @rs-03, Addressed on https://github.com/Unstructured-IO/unstructured-inference/pull/359. You'll need to upgrade unstructured-inference to 0.7.36.

rs-03 commented 3 days ago

Thanks @christinestraub. Upgrading unstructured-inference to 0.7.36 fixed my issue.