bug/partition-pdf-with-infer_table_structure

Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

https://www.unstructured.io/

Apache License 2.0

8.62k stars 704 forks source link

bug/partition-pdf-with-infer_table_structure #3252

Closed DeepKariaX closed 3 months ago

DeepKariaX commented 3 months ago

Describe the bug Giving (ValueError: max() arg is an empty sequence) error when using partition pdf. When i keep the infer_table_structure = True parameter it is giving me this error and after removing this parameter it is working perfectly.

File which received bug unstructured_inference/models/tables.py", line 667, in fill_cells table_rows_no = max({row for cell in cells for row in cell["row_nums"]})

Expected behavior Even if we keep the infer_table_structure = True parameter it should be able to partition the pdf without any errors. (Maybe add error handling when receiving the none value)

vav1lo commented 3 months ago

we got the same issue too. is there any solution ?

DeepKariaX commented 3 months ago

@vav1lo Currently, I have changed to another reader. Also can you attach the pdf which you are testing coz mine is bit confidential to share and with a sample pdf it would be easy for them to diagnose the error.

christinestraub commented 3 months ago

Hi @vav1lo, Can you please attach the pdf that you are testing?

hackpointt commented 3 months ago

uber_10q_march_2022.pdf same problem with this file

`import os from unstructured.partition.pdf import partition_pdf from unstructured.staging.base import elements_to_json

filename = "uber_10q_march_2022.pdf"

elements = partition_pdf( filename=filename, strategy="hi_res", infer_table_structure=True, model_name="yolox", )`

hackpointt commented 3 months ago

@christinestraub

vav1lo commented 3 months ago

@christinestraub Here is the pdf that i am testing 1b4c03d6-f6f5-462d-8bd6-0b9e411bc33d.pdf

Nidhi2497 commented 3 months ago

I am also getting the error while partitioning pdf , and the error is with particularly this argument infer_table_structure=True,

 9 import torch
 10 import transformers

---> 11 from cv2.typing import MatLike 12 from PIL.Image import Image 13 from transformers import DonutProcessor, VisionEncoderDecoderModel

ModuleNotFoundError: No module named 'cv2.typing'; 'cv2' is not a package

vav1lo commented 3 months ago

I am also getting the error while partitioning pdf , and the error is with particularly this argument infer_table_structure=True,
 9 import torch
 10 import transformers
---> 11 from cv2.typing import MatLike 12 from PIL.Image import Image 13 from transformers import DonutProcessor, VisionEncoderDecoderModel

ModuleNotFoundError: No module named 'cv2.typing'; 'cv2' is not a package

I think this has to do with the opencv installation

nikklavzar commented 3 months ago

we got the same issue too. is there any solution ?

This started happening to me when I upgraded from 0.12.6 to 0.14.6

Nidhi2497 commented 3 months ago

I am also getting the error while partitioning pdf , and the error is with particularly this argument infer_table_structure=True,
 9 import torch
 10 import transformers
---> 11 from cv2.typing import MatLike 12 from PIL.Image import Image 13 from transformers import DonutProcessor, VisionEncoderDecoderModel ModuleNotFoundError: No module named 'cv2.typing'; 'cv2' is not a package
I think this has to do with the opencv installation

i installed it as well, but what is being imported there needs to be changed actually

christinestraub commented 3 months ago

Hi @DeepKariaX, @vav1lo, @hackpointt, @Nidhi2497, @nikklavzar

Addressed on https://github.com/Unstructured-IO/unstructured-inference/pull/359. You'll need to upgrade unstructured-inference to 0.7.36. I tested your code with the provided pdf documents and it worked as expected.

christinestraub commented 3 months ago

Closing this since it's assumed to be resolved, but feel free to reopen if you're still having this issue.

DeepKariaX commented 2 months ago

@christinestraub This is resolved, thanks !