Closed mpierangeli-q99 closed 1 month ago
Hi @mpierangeli-q99 - Are you able to provide an example document we could use to reproduce the error?
testing_brochure_1.pdf Hi @MthwRobinson this is the pdf in question. Ty (edit: wrong file)
Hi @mpierangeli-q99, Are you using the latest versions of unstructured(0.14.3) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.
$ pip install unstructured -U
$ pip install unstructured-inference -U
with open(filename, "rb") as pdf_content:
elements = partition_pdf(
file=pdf_content,
extract_images_in_pdf=True,
infer_table_structure=True,
chunking_strategy="by_title",
max_characters=1000,
new_after_n_chars=3000,
combine_text_under_n_chars=1000,
extract_image_block_output_dir=".",
)
print("\n\n".join([str(el) for el in elements]))
Hi @christinestraub i think i confused the file, because that one is working. testing_brochure_2.pdf This one I'm sure doesn't work. FYI unstructured==0.13.6 unstructured-inference==0.7.29
Hi @mpierangeli-q99, I created a PR for a quick fix - https://github.com/Unstructured-IO/unstructured/pull/3130. The error occurred because the table is not recognized in the open-source version. I recommend using the API for improved table extraction performance.
Ty @christinestraub !
Bug Description
After parsing hundred of similar pdfs successfully, an AttributeError emerged in one of them (not particularly different, just another brochure of a company product).
Error Output
_File "/usr/local/lib/python3.11/site-packages/unstructured/documents/elements.py", line 570, in wrapper elements = func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 622, in wrapper elements = func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper elements = func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 83, in wrapper elements = func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 221, in partition_pdf return partition_pdf_or_image( ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 312, in partition_pdf_or_image elements = _partition_pdf_or_image_local( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 516, in _partition_pdf_or_image_local final_document_layout = process_data_with_ocr( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 85, in process_data_with_ocr merged_layouts = process_file_with_ocr( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 181, in process_file_with_ocr raise e File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 169, in process_file_with_ocr merged_page_layout = supplement_page_layout_with_ocr( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 247, in supplement_page_layout_with_ocr page_layout.elements[:] = supplement_element_with_table_extraction( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 297, in supplement_element_with_table_extraction text_as_html = cells_to_html(tatr_cells) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured_inference/models/tables.py", line 675, in cells_to_html cells = sorted(fill_cells(cells), key=lambda k: (min(k["row_nums"]), min(k["column_nums"]))) ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured_inference/models/tables.py", line 663, in fill_cells newcells = cells.copy() ^^^^^^^^^^ AttributeError: 'str' object has no attribute 'copy'
Code Snippet
Environment Info I'm running this on python 3.11 onnx==1.16.1 pdf2image==1.17.0 pdfplumber==0.11.0 pdfminer.six==20231228 pillow_heif==0.16.0 pikepdf==8.15.1 opencv-python==4.9.0.80 unstructured-client==0.22.0 unstructured-inference==0.7.29 unstructured.pytesseract==0.3.12