Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.04k stars 745 forks source link

(partition_pdf) TypeError: unsupported operand type(s) for -: 'int' and 'NoneType' #1651

Closed sentry-io[bot] closed 1 year ago

sentry-io[bot] commented 1 year ago

API users are hitting the following error. The issue is that we can have a bbox of None, see example input below.

TypeError: unsupported operand type(s) for -: 'int' and 'NoneType']

File "prepline_general/api/general.py", line 396, in pipeline_api
    elements = partition(
  File "unstructured/partition/auto.py", line 316, in partition
    elements = _partition_pdf(
  File "unstructured/documents/elements.py", line 276, in wrapper
    elements = func(*args, **kwargs)
  File "unstructured/file_utils/filetype.py", line 551, in wrapper
    elements = func(*args, **kwargs)
  File "unstructured/chunking/title.py", line 211, in wrapper
    elements = func(*args, **kwargs)
  File "unstructured/partition/pdf.py", line 148, in partition_pdf
    return partition_pdf_or_image(
  File "unstructured/partition/pdf.py", line 245, in partition_pdf_or_image
    extracted_elements = extractable_elements(
  File "unstructured/partition/pdf.py", line 171, in extractable_elements
    return _partition_pdf_with_pdfminer(
  File "unstructured/utils.py", line 159, in wrapper
    return func(*args, **kwargs)
  File "unstructured/partition/pdf.py", line 433, in _partition_pdf_with_pdfminer
    elements = _process_pdfminer_pages(
  File "unstructured/partition/pdf.py", line 509, in _process_pdfminer_pages
    urls_metadata.append(map_bbox_and_index(words, annot))
  File "unstructured/partition/pdf.py", line 1033, in map_bbox_and_index
    (annot["bbox"][0] - np.array([word["bbox"][0] for word in words])) ** 2
[
{
bbox: [
None, 
None, 
None, 
None
], 
start_index: 0, 
text: ''
}, 
{
bbox: [
None, 
None, 
None, 
None
], 
start_index: 0, 
text: ''
}, 
{
bbox: [
85.68359081985024, 
123.67545612685615, 
105.5351524327018, 
135.6754556268562
], 
start_index: 4, 
text: 'wfp'
}, 
...
yuming-long commented 1 year ago

also found this error while debugging: https://github.com/Unstructured-IO/unstructured/issues/1663