Upgrade document extraction

flooie commented 5 months ago

@mlissner

This PR is meant to improve the extraction of text from PDFs by using a few additional simple rules to decide if text is extracted appropriately.

Those rules include

Identifying any widget/free text annotations (previously this could lead to incorrect representation of the document)
Images larger than 10% of the page. Meant to exclude tiny images - or images of lines.
Gibberish text from weird text embedding or missing fonts
Documents with less than 10 words on average per page.

Err strings were added for each of these reasons, which should be used in checking if OCR is needed on the CL side. Previously we would/could identify documents as needing OCR - but also returning the text none the less - so that pages could be missed and CL wouldnt be aware that it might want to OCR the document.

Additionally, an optional flag has been added skip-margins as a boolean that can be used to crop out the 1 inch margins that are required for court opinions as well as skewed stamp text we see in some courts. This is meant to get the text to represent the text of the opinion.

Tests were updated for the PDF changes and different possible difficult Pdfs were included.

Finally, a change to LXML and html Cleaning was addressed by adding `lxml_html_clean'.

flooie commented 5 months ago

Cool. I made a few comments, but none that I think is too crazy. My one remaining doubt is what the output looks like compared to the old output. Can you provide some examples of normal, good, bad, ugly, etc so we can see the improvement here?

I also worry about if we move to striping margins by default that that will cause trouble down the road when we remove more than we want, like, for example, on a scanned document where the scan is off center or something. Maybe it's safer to remove the margin at the top and left and leave the bottom and right?

I have to go back thru the rest of your comments and I will provide some sample output but I wanted to address a few things.

strip_margins is set to false by default.
strip_margins only applies to good PDFs that can be extracted with OCR. As you rightly point out we don't want to strip or crop out the margins in a scan because the margins could include actual content in an image. And it only works for the content extracted in PDF plumber.

mlissner commented 4 months ago

I'm chatting with a customer now that values doctor for its high-speed text extraction. Could we keep pdftotext in this PR, and have a v2 text extractor that has all your improvements?

flooie commented 4 months ago

I heavily simplified the code and created a NEW pr for it. or am - so im closing this PR

freelawproject / doctor

Upgrade document extraction #187