Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.04k stars 745 forks source link

UnboundLocalError: local variable 'uri' referenced before assignment #1686

Closed sentry-io[bot] closed 1 year ago

sentry-io[bot] commented 1 year ago

Some users are hitting the following error in partition_pdf.

UnboundLocalError: local variable 'uri' referenced before assignment

File "unstructured/partition/pdf.py", line 155, in partition_pdf
    return partition_pdf_or_image(
  File "unstructured/partition/pdf.py", line 252, in partition_pdf_or_image
    extracted_elements = extractable_elements(
  File "unstructured/partition/pdf.py", line 178, in extractable_elements
    return _partition_pdf_with_pdfminer(
  File "unstructured/utils.py", line 159, in wrapper
    return func(*args, **kwargs)
  File "unstructured/partition/pdf.py", line 440, in _partition_pdf_with_pdfminer
    elements = _process_pdfminer_pages(
  File "unstructured/partition/pdf.py", line 500, in _process_pdfminer_pages
    annotation_list = get_uris(page.annots, height, coordinate_system, i + 1)
  File "unstructured/partition/pdf.py", line 892, in get_uris
    return get_uris_from_annots(annots, height, coordinate_system, page_number)
  File "unstructured/partition/pdf.py", line 946, in get_uris_from_annots
    "uri": uri,
awalker4 commented 1 year ago

Here's the bad annotation:

{
A: {
D: b'M9.72144.TitleChapter.Chapter.Title', 
F: {}, 
S: /'GoToR'
}, 
Border: [
0, 
0, 
0
], 
Rect: [
240.24, 
678.36, 
294.96, 
691.86
], 
Subtype: /'Link', 
Type: /'Annot'
}

We need to initialize uri here if we don't see one of the expected uri types. Separately - what is GoToR and should we add a case for it?

Klaijan commented 1 year ago

Agree on the initialization.

I just checked the doc now, GoToR is pointing to resources of another pdf file. In that case, I assume we could include the metadata, but we would not be able to link to any other pdfs.

Klaijan commented 1 year ago

I do see the uri falls to None if not matching any of the type (here https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/pdf.py#L968). I believe that should be enough to avoid the error. I'm suspecting we are not capturing all the Exception here https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/pdf.py#L967. What do you think?

awalker4 commented 1 year ago

I think I would do something like this to be safe. If uri_type is not one of these, we'll fall through without hitting the except.

uri = None
try:
    if uri_type == "/'URI'":
        uri = try_resolve(try_resolve(uri_dict["URI"])).decode("utf-8")
    if uri_type == "/'GoTo'":
        uri = try_resolve(try_resolve(uri_dict["D"])).decode("utf-8")
except (KeyError, AttributeError, TypeError, UnicodeDecodeError):
    pass