Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.62k stars 704 forks source link

partition_pdf got TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format' #3253

Closed liyang79 closed 2 months ago

liyang79 commented 3 months ago

Describe the bug partition_pdf got TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format'

To Reproduce

from unstructured.partition.pdf import partition_pdf

filename = "./data/salesforce-fy24-annual-report.pdf"
# file downloaded from https://s23.q4cdn.com/574569502/files/doc_financials/2024/ar/salesforce-fy24-annual-report.pdf
elements = partition_pdf(filename=filename, strategy="hi_res", infer_table_structure=True)

Expected behavior No error.

Screenshots

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 elements = partition_pdf(
      2     filename=file_path,
      3 
      4     # Unstructured Helpers
      5     strategy="hi_res", 
      6     infer_table_structure=True, 
      7 )

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/documents/elements.py:593, in process_metadata.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    591 @functools.wraps(func)
    592 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> list[Element]:
--> 593     elements = func(*args, **kwargs)
    594     call_args = get_call_args_applying_defaults(func, *args, **kwargs)
    596     regex_metadata: dict["str", "str"] = call_args.get("regex_metadata", {})

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:626, in add_filetype.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    624 @functools.wraps(func)
    625 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 626     elements = func(*args, **kwargs)
    627     params = get_call_args_applying_defaults(func, *args, **kwargs)
    628     include_metadata = params.get("include_metadata", True)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:582, in add_metadata.<locals>.wrapper(*args, **kwargs)
    580 @functools.wraps(func)
    581 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 582     elements = func(*args, **kwargs)
    583     call_args = get_call_args_applying_defaults(func, *args, **kwargs)
    584     include_metadata = call_args.get("include_metadata", True)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:74, in add_chunking_strategy.<locals>.wrapper(*args, **kwargs)
     71 """The decorated function is replaced with this one."""
     73 # -- call the partitioning function to get the elements --
---> 74 elements = func(*args, **kwargs)
     76 # -- look for a chunking-strategy argument --
     77 call_args = get_call_args_applying_defaults(func, *args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf.py:192, in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    188 exactly_one(filename=filename, file=file)
    190 languages = check_language_args(languages or [], ocr_languages) or ["eng"]
--> 192 return partition_pdf_or_image(
    193     filename=filename,
    194     file=file,
    195     include_page_breaks=include_page_breaks,
    196     strategy=strategy,
    197     infer_table_structure=infer_table_structure,
    198     languages=languages,
    199     metadata_last_modified=metadata_last_modified,
    200     hi_res_model_name=hi_res_model_name,
    201     extract_images_in_pdf=extract_images_in_pdf,
    202     extract_image_block_types=extract_image_block_types,
    203     extract_image_block_output_dir=extract_image_block_output_dir,
    204     extract_image_block_to_payload=extract_image_block_to_payload,
    205     date_from_file_object=date_from_file_object,
    206     starting_page_number=starting_page_number,
    207     extract_forms=extract_forms,
    208     form_extraction_skip_tables=form_extraction_skip_tables,
    209     **kwargs,
    210 )

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf.py:288, in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, metadata_last_modified, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    286     with warnings.catch_warnings():
    287         warnings.simplefilter("ignore")
--> 288         elements = _partition_pdf_or_image_local(
    289             filename=filename,
    290             file=spooled_to_bytes_io_if_needed(file),
    291             is_image=is_image,
    292             infer_table_structure=infer_table_structure,
    293             include_page_breaks=include_page_breaks,
    294             languages=languages,
    295             metadata_last_modified=metadata_last_modified or last_modification_date,
    296             hi_res_model_name=hi_res_model_name,
    297             pdf_text_extractable=pdf_text_extractable,
    298             extract_images_in_pdf=extract_images_in_pdf,
    299             extract_image_block_types=extract_image_block_types,
    300             extract_image_block_output_dir=extract_image_block_output_dir,
    301             extract_image_block_to_payload=extract_image_block_to_payload,
    302             starting_page_number=starting_page_number,
    303             extract_forms=extract_forms,
    304             form_extraction_skip_tables=form_extraction_skip_tables,
    305             **kwargs,
    306         )
    307         out_elements = _process_uncategorized_text_elements(elements)
    309 elif strategy == PartitionStrategy.FAST:

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/utils.py:249, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    246 @wraps(func)
    247 def wrapper(*args: _P.args, **kwargs: _P.kwargs):
    248     run_check()
--> 249     return func(*args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf.py:580, in _partition_pdf_or_image_local(filename, file, is_image, infer_table_structure, include_page_breaks, languages, ocr_mode, model_name, hi_res_model_name, pdf_image_dpi, metadata_last_modified, pdf_text_extractable, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, analysis, analyzed_image_output_dir_path, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    573         # NOTE(christine): merged_document_layout = extracted_layout + inferred_layout
    574         merged_document_layout = merge_inferred_with_extracted_layout(
    575             inferred_document_layout=inferred_document_layout,
    576             extracted_layout=extracted_layout,
    577             hi_res_model_name=hi_res_model_name,
    578         )
--> 580         final_document_layout = process_file_with_ocr(
    581             filename,
    582             merged_document_layout,
    583             extracted_layout=extracted_layout,
    584             is_image=is_image,
    585             infer_table_structure=infer_table_structure,
    586             ocr_languages=ocr_languages,
    587             ocr_mode=ocr_mode,
    588             pdf_image_dpi=pdf_image_dpi,
    589         )
    590 else:
    591     inferred_document_layout = process_data_with_model(
    592         file,
    593         is_image=is_image,
    594         model_name=hi_res_model_name,
    595         pdf_image_dpi=pdf_image_dpi,
    596     )

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/utils.py:249, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    246 @wraps(func)
    247 def wrapper(*args: _P.args, **kwargs: _P.kwargs):
    248     run_check()
--> 249     return func(*args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py:166, in process_file_with_ocr(filename, out_layout, extracted_layout, is_image, infer_table_structure, ocr_languages, ocr_mode, pdf_image_dpi)
    164 except Exception as e:
    165     if os.path.isdir(filename) or os.path.isfile(filename):
--> 166         raise e
    167     else:
    168         raise FileNotFoundError(f'File "{filename}" not found!') from e

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py:154, in process_file_with_ocr(filename, out_layout, extracted_layout, is_image, infer_table_structure, ocr_languages, ocr_mode, pdf_image_dpi)
    152     extracted_regions = extracted_layout[i] if i < len(extracted_layout) else None
    153     with PILImage.open(image_path) as image:
--> 154         merged_page_layout = supplement_page_layout_with_ocr(
    155             page_layout=out_layout.pages[i],
    156             image=image,
    157             infer_table_structure=infer_table_structure,
    158             ocr_languages=ocr_languages,
    159             ocr_mode=ocr_mode,
    160             extracted_regions=extracted_regions,
    161         )
    162         merged_page_layouts.append(merged_page_layout)
    163 return DocumentLayout.from_pages(merged_page_layouts)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/utils.py:249, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    246 @wraps(func)
    247 def wrapper(*args: _P.args, **kwargs: _P.kwargs):
    248     run_check()
--> 249     return func(*args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py:232, in supplement_page_layout_with_ocr(page_layout, image, infer_table_structure, ocr_languages, ocr_mode, extracted_regions)
    229     if tables.tables_agent is None:
    230         raise RuntimeError("Unable to load table extraction agent.")
--> 232     page_layout.elements[:] = supplement_element_with_table_extraction(
    233         elements=cast(List["LayoutElement"], page_layout.elements),
    234         image=image,
    235         tables_agent=tables.tables_agent,
    236         ocr_languages=ocr_languages,
    237         ocr_agent=ocr_agent,
    238         extracted_regions=extracted_regions,
    239     )
    241 return page_layout

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/utils.py:249, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    246 @wraps(func)
    247 def wrapper(*args: _P.args, **kwargs: _P.kwargs):
    248     run_check()
--> 249     return func(*args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py:279, in supplement_element_with_table_extraction(elements, image, tables_agent, ocr_languages, ocr_agent, extracted_regions)
    264 cropped_image = image.crop(
    265     (
    266         padded_element.bbox.x1,
   (...)
    270     ),
    271 )
    272 table_tokens = get_table_tokens(
    273     table_element_image=cropped_image,
    274     ocr_languages=ocr_languages,
   (...)
    277     table_element=padded_element,
    278 )
--> 279 tatr_cells = tables_agent.predict(
    280     cropped_image, ocr_tokens=table_tokens, result_format="cells"
    281 )
    283 # NOTE(christine): `tatr_cells == ""` means that the table was not recognized
    284 text_as_html = "" if tatr_cells == "" else cells_to_html(tatr_cells)

TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format'

Environment Info Please run python scripts/collect_env.py and paste the output here.

/data/projects/collect_env.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
OS version:  Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-glibc2.28
Python version:  3.11.5
unstructured version:  0.14.6
unstructured-inference version:  0.7.15
pytesseract version:  0.3.10
Torch version:  2.3.0
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.11
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice version:  LibreOffice 5.3.6.1 30(Build:1)

Additional context Add any other context about the problem here.

SystemAgent commented 3 months ago

Hi! I have been getting the same error today when trying to use partition_pdf - TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format'

When the infer_table_structure=False it manages to partition the pdf file, but that is not a solution in my case since the Tables are the critical elements that need to be extracted.

IngLP commented 3 months ago

I have the same problem here!

christinestraub commented 3 months ago

Hi @liyang79 @IngLP

I think you're using an old version of unstructured-inference library (0.7.15). You won't get this error if you upgrade both unstructured-inference and unstructured libraries to the latest versions.

liyang79 commented 3 months ago

@christinestraub You're right. Problem is solved after upgrading the latest unstructured-inference library. Thanks.

IngLP commented 3 months ago

@christinestraub I have: Python3.10, unstructured = {extras = ["pdf"], version = "^0.14.8"} in my poetry config, unstructured 0.14.8 and unstructured-inference 0.7.36. But I still get the error.

christinestraub commented 3 months ago

@IngLP Can you please provide a pdf document that we could use to reproduce?

IngLP commented 2 months ago

Hi @christinestraub , I deleted and recreated the whole Python environment and now everything works. Thank you for your help.