Closed sentry-io[bot] closed 1 year ago
Here's the bad annotation:
{
A: {
D: b'M9.72144.TitleChapter.Chapter.Title',
F: {},
S: /'GoToR'
},
Border: [
0,
0,
0
],
Rect: [
240.24,
678.36,
294.96,
691.86
],
Subtype: /'Link',
Type: /'Annot'
}
We need to initialize uri
here if we don't see one of the expected uri types. Separately - what is GoToR
and should we add a case for it?
Agree on the initialization.
I just checked the doc now, GoToR
is pointing to resources of another pdf file. In that case, I assume we could include the metadata, but we would not be able to link to any other pdfs.
I do see the uri
falls to None
if not matching any of the type
(here https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/pdf.py#L968). I believe that should be enough to avoid the error. I'm suspecting we are not capturing all the Exception
here https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/pdf.py#L967. What do you think?
I think I would do something like this to be safe. If uri_type is not one of these, we'll fall through without hitting the except
.
uri = None
try:
if uri_type == "/'URI'":
uri = try_resolve(try_resolve(uri_dict["URI"])).decode("utf-8")
if uri_type == "/'GoTo'":
uri = try_resolve(try_resolve(uri_dict["D"])).decode("utf-8")
except (KeyError, AttributeError, TypeError, UnicodeDecodeError):
pass
Some users are hitting the following error in partition_pdf.
UnboundLocalError: local variable 'uri' referenced before assignment