Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.94k stars 733 forks source link

Issue with partition_pdf #2316

Closed pranavbhat12 closed 4 months ago

pranavbhat12 commented 10 months ago

While trying to read pdf file with partition_pdf function I am getting this error:

RuntimeError Traceback (most recent call last) Cell In[11], line 7 4 from unstructured.partition.pdf import partition_pdf 6 # Get elements ----> 7 raw_pdf_elements = partition_pdf( 8 filename="docs/sample.pdf", 9 # Unstructured first finds embedded image blocks 10 extract_images_in_pdf=False, 11 strategy="hi_res", 12 # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles 13 # Titles are any sub-section of the document 14 infer_table_structure=True, 15 # Post processing to aggregate text once we have the title 16 chunking_strategy="by_title", 17 # Chunking params to aggregate text blocks 18 # Attempt to create a new chunk 3800 chars 19 # Attempt to keep chunks > 2000 chars 20 max_characters=1000, 21 new_after_n_chars=500, 22 combine_text_under_n_chars=200, 23 image_output_dir_path="docs/", 24
25 )

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/documents/elements.py:514, in process_metadata..decorator..wrapper(*args, kwargs) 512 @functools.wraps(func) 513 def wrapper(*args: _P.args, *kwargs: _P.kwargs) -> List[Element]: --> 514 elements = func(args, kwargs) 515 sig = inspect.signature(func) 516 params: Dict[str, Any] = dict(dict(zip(sig.parameters, args)), kwargs)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/file_utils/filetype.py:591, in add_filetype..decorator..wrapper(*args, kwargs) 589 @functools.wraps(func) 590 def wrapper(*args: _P.args, *kwargs: _P.kwargs) -> List[Element]: --> 591 elements = func(args, kwargs) 592 sig = inspect.signature(func) 593 params: Dict[str, Any] = dict(dict(zip(sig.parameters, args)), kwargs)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/file_utils/filetype.py:546, in add_metadata..wrapper(*args, kwargs) 544 @functools.wraps(func) 545 def wrapper(*args: _P.args, *kwargs: _P.kwargs) -> List[Element]: --> 546 elements = func(args, kwargs) 547 sig = inspect.signature(func) 548 params: Dict[str, Any] = dict(dict(zip(sig.parameters, args)), kwargs)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/chunking/init.py:52, in add_chunking_strategy..decorator..wrapper(*args, kwargs) 50 @functools.wraps(func) 51 def wrapper(*args: _P.args, *kwargs: _P.kwargs) -> List[Element]: ---> 52 elements = func(args, kwargs) 53 sig = inspect.signature(func) 54 params: Dict[str, Any] = dict(dict(zip(sig.parameters, args)), kwargs)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf.py:191, in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, links, extract_images_in_pdf, extract_element_types, image_output_dir_path, kwargs) 187 exactly_one(filename=filename, file=file) 189 languages = check_languages(languages, ocr_languages) --> 191 return partition_pdf_or_image( 192 filename=filename, 193 file=file, 194 include_page_breaks=include_page_breaks, 195 strategy=strategy, 196 infer_table_structure=infer_table_structure, 197 languages=languages, 198 metadata_last_modified=metadata_last_modified, 199 extract_images_in_pdf=extract_images_in_pdf, 200 extract_element_types=extract_element_types, 201 image_output_dir_path=image_output_dir_path, 202 kwargs, 203 )

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf.py:505, in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, metadata_last_modified, extract_images_in_pdf, extract_element_types, image_output_dir_path, kwargs) 503 with warnings.catch_warnings(): 504 warnings.simplefilter("ignore") --> 505 elements = _partition_pdf_or_image_local( 506 filename=filename, 507 file=spooled_to_bytes_io_if_needed(file), 508 is_image=is_image, 509 infer_table_structure=infer_table_structure, 510 include_page_breaks=include_page_breaks, 511 languages=languages, 512 metadata_last_modified=metadata_last_modified or last_modification_date, 513 pdf_text_extractable=pdf_text_extractable, 514 extract_images_in_pdf=extract_images_in_pdf, 515 extract_element_types=extract_element_types, 516 image_output_dir_path=image_output_dir_path, 517 kwargs, 518 ) 519 out_elements = _process_uncategorized_text_elements(elements) 521 elif strategy == PartitionStrategy.FAST:

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/utils.py:214, in requires_dependencies..decorator..wrapper(*args, *kwargs) 205 if len(missing_deps) > 0: 206 raise ImportError( 207 f"Following dependencies are missing: {', '.join(missing_deps)}. " 208 + ( (...) 212 ), 213 ) --> 214 return func(args, **kwargs)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf.py:321, in _partition_pdf_or_image_local(filename, file, is_image, infer_table_structure, include_page_breaks, languages, ocr_mode, model_name, metadata_last_modified, pdf_text_extractable, extract_images_in_pdf, extract_element_types, image_output_dir_path, pdf_image_dpi, analysis, analyzed_image_output_dir_path, **kwargs) 319 final_document_layout = merged_document_layout 320 else: --> 321 final_document_layout = process_file_with_ocr( 322 filename, 323 merged_document_layout, 324 is_image=is_image, 325 infer_table_structure=infer_table_structure, 326 ocr_languages=ocr_languages, 327 ocr_mode=ocr_mode, 328 pdf_image_dpi=pdf_image_dpi, 329 ) 330 else: 331 inferred_document_layout = process_data_with_model( 332 file, 333 is_image=is_image, 334 model_name=model_name, 335 pdf_image_dpi=pdf_image_dpi, 336 )

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf_image/ocr.py:171, in process_file_with_ocr(filename, out_layout, is_image, infer_table_structure, ocr_languages, ocr_mode, pdf_image_dpi) 169 except Exception as e: 170 if os.path.isdir(filename) or os.path.isfile(filename): --> 171 raise e 172 else: 173 raise FileNotFoundError(f'File "{filename}" not found!') from e

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf_image/ocr.py:160, in process_file_with_ocr(filename, out_layout, is_image, infer_table_structure, ocr_languages, ocr_mode, pdf_image_dpi) 158 for i, image_path in enumerate(image_paths): 159 with PILImage.open(image_path) as image: --> 160 merged_page_layout = supplement_page_layout_with_ocr( 161 out_layout.pages[i], 162 image, 163 infer_table_structure=infer_table_structure, 164 ocr_languages=ocr_languages, 165 ocr_mode=ocr_mode, 166 ) 167 merged_page_layouts.append(merged_page_layout) 168 return DocumentLayout.from_pages(merged_page_layouts)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf_image/ocr.py:237, in supplement_page_layout_with_ocr(page_layout, image, infer_table_structure, ocr_languages, ocr_mode) 234 if tables.tables_agent is None: 235 raise RuntimeError("Unable to load table extraction agent.") --> 237 page_layout.elements[:] = supplement_element_with_table_extraction( 238 elements=cast(List[LayoutElement], page_layout.elements), 239 image=image, 240 tables_agent=tables.tables_agent, 241 ocr_languages=ocr_languages, 242 ocr_agent=ocr_agent, 243 ) 245 return page_layout

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured/partition/pdf_image/ocr.py:274, in supplement_element_with_table_extraction(elements, image, tables_agent, ocr_languages, ocr_agent) 263 cropped_image = image.crop( 264 ( 265 padded_element.bbox.x1, (...) 269 ), 270 ) 271 table_tokens = get_table_tokens( 272 image=cropped_image, ocr_languages=ocr_languages, ocr_agent=ocr_agent 273 ) --> 274 element.text_as_html = tables_agent.predict(cropped_image, ocr_tokens=table_tokens) 275 return elements

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured_inference/models/tables.py:53, in UnstructuredTableTransformerModel.predict(self, x, ocr_tokens) 37 """Predict table structure deferring to run_prediction with ocr tokens 38 39 Note: (...) 50 FIXME: refactor token data into a dataclass so we have clear expectations of the fields 51 """ 52 super().predict(x) ---> 53 return self.run_prediction(x, ocr_tokens=ocr_tokens)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured_inference/models/tables.py:182, in UnstructuredTableTransformerModel.run_prediction(self, x, pad_for_structure_detection, ocr_tokens, result_format) 174 def run_prediction( 175 self, 176 x: Image, (...) 179 result_format: Optional[str] = "html", 180 ): 181 """Predict table structure""" --> 182 outputs_structure = self.get_structure(x, pad_for_structure_detection) 183 if ocr_tokens is None: 184 logger.warning( 185 "Table OCR from get_tokens method will be deprecated. " 186 "In the future the OCR tokens are expected to be passed in.", 187 )

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/unstructured_inference/models/tables.py:169, in UnstructuredTableTransformerModel.get_structure(self, x, pad_for_structure_detection) 164 with torch.no_grad(): 165 logger.info(f"padding image by {pad_for_structure_detection} for structure detection") 166 encoding = self.feature_extractor( 167 pad_image_with_background_color(x, pad_for_structure_detection), 168 return_tensors="pt", --> 169 ).to(self.device) 170 outputs_structure = self.model(**encoding) 171 outputs_structure["pad_for_structure_detection"] = pad_for_structure_detection

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/transformers/feature_extraction_utils.py:231, in BatchFeature.to(self, *args, *kwargs) 227 for k, v in self.items(): 228 # check if v is a floating point 229 if torch.is_floating_point(v): 230 # cast and send to device --> 231 new_data[k] = v.to(args, **kwargs) 232 elif device is not None: 233 new_data[k] = v.to(device=device)

RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Code: from typing import Any

from pydantic import BaseModel from unstructured.partition.pdf import partition_pdf

Get elements

raw_pdf_elements = partition_pdf( filename="docs/sample.pdf",

Unstructured first finds embedded image blocks

extract_images_in_pdf=False,
strategy="hi_res",
# Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
# Titles are any sub-section of the document
infer_table_structure=True,
# Post processing to aggregate text once we have the title
chunking_strategy="by_title",
# Chunking params to aggregate text blocks
# Attempt to create a new chunk 3800 chars
# Attempt to keep chunks > 2000 chars
max_characters=1000,
new_after_n_chars=500,
combine_text_under_n_chars=200,
image_output_dir_path="docs/",

)

How can we solve this error ?

christinestraub commented 10 months ago

Hi @pranavbhat12, Can you provide the versions of unstructured and unstructured-inference you have installed, and provide the file that you are testing with, sample.pdf?

pranavbhat12 commented 10 months ago

Thankyou for reaching out!!

I did just !pip install "unstructured[all-docs]" and pdf which I used is attached.Also tried downgrading to version 0.11.5. HDFC_MF_Factsheet__July_2022.pdf

pranavbhat12 commented 10 months ago

This issue is mainly when setting strategy to "hi_res".As per the error, problem strectches to unstructured_inference library with TableTransformers code.

HardKothari commented 6 months ago

Issue seems to be stemming from unstructured > partition > pdf_image > ocr.py line: 274

element.text_as_html = tables_agent.predict(cropped_image, ocr_tokens=table_tokens)

It seems that getting text_as_html metadata is having issues.

For now, adding this try except fix is working temporarily to send blank string on error, but permanent fix would be advisable in this case.

  # HK:4/30/2024
  try:
      element.text_as_html = tables_agent.predict(cropped_image, ocr_tokens=table_tokens) 
  except:
      element.text_as_html = ""

The actual error seems to be happening in

unstructure_inference > models > tables.py line:190

prediction = recognize(outputs_structure, x, tokens=ocr_tokens)[0]

The recognize method seems to have been empty array since there are no tokens derived, I suppose.

Help on fixing this would be appreciated.

Thanks

DeepKariaX commented 4 months ago

This issue is mainly when setting strategy to "hi_res".As per the error, problem strectches to unstructured_inference library with TableTransformers code.

Just a correction -> This issue happens when setting infer_table_structure = True.

Aarsh01 commented 4 months ago

After installation of tesseract, it is showing AttributeError error. AttributeError: 'tuple' object has no attribute 'tb_frame'

If anyone know how to solve this problem, please reply asap .

christinestraub commented 4 months ago

Hi @pranavbhat12 @HardKothari @DeepKariaX @Aarsh01

This issue appears to be related to #3119 and should be resolved by the changes implemented in PR #3130. I tried to reproduce the issue you described using unstructured version 0.14.5 and unstructured-inference version 0.7.33, but I could not encounter any errors while partitioning the provided PDF document (HDFC_MF_Factsheet__July_2022.pdf). If you are still experiencing errors, please share the specific PDF document that is causing the problem. This will enable us to investigate further and identify the root cause of the issue. Please feel free to attach the PDF document or provide a link to it in your response. We'll be happy to take a closer look and assist you in resolving this matter. Thank you for your cooperation and patience. We value your feedback and are committed to ensuring a smooth experience with our library.

DeepKariaX commented 4 months ago

@christinestraub I have updated the version and getting : ValueError: max() arg is an empty sequence table_rows_no = max({row for cell in cells for row in cell["row_nums"]})

unstructured_inference/models/tables.py", line 667, in fill_cells

Unfortunately, I cannot share the pdf - when i keep the infer_table_structure = True parameter it is giving me this error and after removing this parameter it is working perfectly.

christinestraub commented 4 months ago

Similar to https://github.com/Unstructured-IO/unstructured/issues/3252, closing this since it's assumed to be resolved, but feel free to reopen if you're still having this issue.