Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.54k stars 595 forks source link

bug/parsing pdf error - new_cells as str has no "copy" #3119

Closed mpierangeli-q99 closed 1 month ago

mpierangeli-q99 commented 1 month ago

Bug Description

After parsing hundred of similar pdfs successfully, an AttributeError emerged in one of them (not particularly different, just another brochure of a company product).

Error Output

_File "/usr/local/lib/python3.11/site-packages/unstructured/documents/elements.py", line 570, in wrapper elements = func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 622, in wrapper elements = func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper elements = func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 83, in wrapper elements = func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 221, in partition_pdf return partition_pdf_or_image( ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 312, in partition_pdf_or_image elements = _partition_pdf_or_image_local( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 516, in _partition_pdf_or_image_local final_document_layout = process_data_with_ocr( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 85, in process_data_with_ocr merged_layouts = process_file_with_ocr( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 181, in process_file_with_ocr raise e File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 169, in process_file_with_ocr merged_page_layout = supplement_page_layout_with_ocr( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 247, in supplement_page_layout_with_ocr page_layout.elements[:] = supplement_element_with_table_extraction( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 297, in supplement_element_with_table_extraction text_as_html = cells_to_html(tatr_cells) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured_inference/models/tables.py", line 675, in cells_to_html cells = sorted(fill_cells(cells), key=lambda k: (min(k["row_nums"]), min(k["column_nums"]))) ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured_inference/models/tables.py", line 663, in fill_cells newcells = cells.copy() ^^^^^^^^^^ AttributeError: 'str' object has no attribute 'copy'

Code Snippet

raw_pdf_elements = partition_pdf(
                        file=pdf_content,
                        extract_images_in_pdf=True,
                        infer_table_structure=True,
                        chunking_strategy="by_title",
                        max_characters=CHUNK_LENGTH, 
                        new_after_n_chars=CHUNK_LENGTH * 3,
                        combine_text_under_n_chars=CHUNK_LENGTH,  
                        extract_image_block_output_dir=temp_path,
                    )

Environment Info I'm running this on python 3.11 onnx==1.16.1 pdf2image==1.17.0 pdfplumber==0.11.0 pdfminer.six==20231228 pillow_heif==0.16.0 pikepdf==8.15.1 opencv-python==4.9.0.80 unstructured-client==0.22.0 unstructured-inference==0.7.29 unstructured.pytesseract==0.3.12

MthwRobinson commented 1 month ago

Hi @mpierangeli-q99 - Are you able to provide an example document we could use to reproduce the error?

mpierangeli-q99 commented 1 month ago

testing_brochure_1.pdf Hi @MthwRobinson this is the pdf in question. Ty (edit: wrong file)

christinestraub commented 1 month ago

Hi @mpierangeli-q99, Are you using the latest versions of unstructured(0.14.3) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.

$ pip install unstructured -U
$ pip install unstructured-inference -U
with open(filename, "rb") as pdf_content:
    elements = partition_pdf(
        file=pdf_content,
        extract_images_in_pdf=True,
        infer_table_structure=True,
        chunking_strategy="by_title",
        max_characters=1000,
        new_after_n_chars=3000,
        combine_text_under_n_chars=1000,
        extract_image_block_output_dir=".",
    )

print("\n\n".join([str(el) for el in elements]))
mpierangeli-q99 commented 1 month ago

Hi @christinestraub i think i confused the file, because that one is working. testing_brochure_2.pdf This one I'm sure doesn't work. FYI unstructured==0.13.6 unstructured-inference==0.7.29

christinestraub commented 1 month ago

Hi @mpierangeli-q99, I created a PR for a quick fix - https://github.com/Unstructured-IO/unstructured/pull/3130. The error occurred because the table is not recognized in the open-source version. I recommend using the API for improved table extraction performance.

mpierangeli-q99 commented 1 month ago

Ty @christinestraub !