Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.62k stars 704 forks source link

bug/misclassification of Title Element when using "fast" for PDF parsing #550

Closed cragwolfe closed 9 months ago

cragwolfe commented 1 year ago

Describe the bug From community slack:

One major issue I am facing is in many cases last sentence of a paragraph is getting classified as Title instead of NarrativeText or Text Examples Paragraph : So I think your understanding is fine, it is not going to have an impact either the price rise or price drop in any of the intermediate chemicals.

is_possible_title("drop in any of the intermediate chemicals.") # returns True Paragraph : With respect to the guarantees have the guarantees given to Inox Group basically been fixed or are we in the process on that?

is_possible_title("the process on that?") # returns True

Maybe a check like text should not start with lowercase letter for it to be classified as a Title would be helpful

Additional context Reported in slack

cragwolfe commented 1 year ago

There are plenty of occurrences of this in the outputs from this PDF.

jq '.[] | select(.type == "Title") | .text' PLAW-107publ56.pdf.json | grep -P '^"[a-z]+'
"of Investigation."
"gencies."
"to terrorism."
"to computer fraud and abuse offenses."
"agents of a foreign power."
"limb."
"lance Act."
"trace devices."
"transactions of primary money laundering concern."
"counts."
"banks."
"crimes, and the finances of terrorist groups."
"ment references."
"vestment company study."
"business."
"ports of entry and overseas consular posts."
"view."
"safety officers."
"systems."
"rorism."
"sponse to Government requests."
"ligence under National Security Act of 1947."
"ligence and intelligence"
"related matters."
"eign intelligence."
"for bioterrorism preparedness and response."
...