freelawproject / doctor

A microservice for document conversion at scale
https://free.law/projects/doctor
BSD 2-Clause "Simplified" License
54 stars 14 forks source link

Improvements to text extraction needed #186

Open flooie opened 5 months ago

flooie commented 5 months ago

The Needs OCR function needs to be improved. Currently we do this to determine if something that is OCR eligible should be OCRd.

The Situation

if content.strip() == "" or pdf_has_images(path):
    return True

The content is generated from pdftotext

using this code

process = subprocess.Popen(
    ["pdftotext", "-layout", "-enc", "UTF-8", path, "-"],
    shell=False,
    stdout=subprocess.PIPE,
    stderr=subprocess.DEVNULL,
)
content, err = process.communicate()
return content.decode(), err, process.returncode

later down stream - on CL we take the content - and say - are we sure we didn't need to OCR this and we do this

for line in content.splitlines():
    line = line.strip()
    if line.startswith(("Case", "Appellate", "Appeal", "USCA")):
        continue
    elif line:
        # We found a line with good content. No OCR needed.
        return False

# We arrive here if no line was found containing good content.
return True

Where we look for any row that doesnt appear to be a bates stamp. And as long as we find any text - garbled or otherwise we say we are good to go.

This leads unfortunately to some seriously garbled plain text in our Recap - and potentially our opinion db.

Examples

I dont want to rag on pdftotext it has done an admirable job for the most part but I do not think it is the best way to approach what we dealing with now. For one - we are attempting to extract out content and place it into a plain text db field. This is challenging because a good amount of documents contain pdf objects, such as /widgets, /annotations, /freetext, /Stamp and /Popup. Although this is not an exhaustive list we see links and signatures, and I'm sure more types.

In addition to the complexity of handling documents that contain pdf stream objects, we also have to deal with images inserted into PDFs or even worse, the first or maybe just the last page being a rasterized PDF pages while the middle 30 odd pages being vector PDFs.

In this case - our checks fail and have no way to catch them because - after we iterate beyond the bates stamp on page 2 we get good text. See: gov.uscourts.nysd.411264.100.0.pdf

This also fails when - for example, a free text widget is added on to the PDF page of an image that crosses out content or adds content to the page.

Here is an example - of a non image pdf page - containing Free Text widget (widget I think, it could be something different) meant to cross out the PROPOSED part.

Screenshot 2024-04-19 at 11 29 59 AM

This is not the perfect example, because the underlying content appears to contain text but is corrupted and looks like this

Screenshot 2024-04-19 at 11 32 02 AM

In fact, williams-v-t-mobile

Side by Side comparison of Williams v T-Mobile

Note the proposed - is incorrectly added here to the text frustrating the adjustment made by the court. Which is noted in the document itself.

Screenshot 2024-04-19 at 11 35 25 AM Screenshot 2024-04-19 at 11 36 36 AM

Angled, Circular, and Sideways Text

Not to be out done - many judges - 👋 CAND likes to use Stamps with circular text. These stamps are often at the end of the document but not exclusively. In doing that the courts introduce gibberish into our documents when we extract the text or OCR them.

For example gov.uscourts.cand.16711.1203.0.pdf and another file have them adjacent to the text. One of these is stamped into an image pdf and the other is in a regular pdf and garbles it.

Screenshot 2024-04-19 at 11 48 19 AM Screenshot 2024-04-19 at 11 47 37 AM

In both cases - the content that is generated makes the test for OCR fail to identify a needed OCR.

Sideways Text

We also run into this problem where - pdftotext does an amazing job of figuring out the text on the side and writing it into the text. But here is the result - this is just a fancy thing some courts - and some firms like to do.

Screenshot 2024-04-19 at 11 55 32 AM

But look at the result. It unnaturally expands the plain text - and frustrates plain text searches for sure.

Screenshot 2024-04-19 at 11 54 41 AM

In this case and in others see below.

Margin Text

Occasionally the use of margin text in small font causes some weird creations in text. which again cause extra wide text that is hard to view and display and which I think make it hard to query or search for the content you may be looking for.

Screenshot 2024-04-19 at 11 57 01 AM Screenshot 2024-04-19 at 11 59 15 AM

Final complaint (Bates Stamps)

Bates stamps on every page are ingested into the content and dont reflect the document that was generated. I would not expect to see bates stamps or sidebar content in a published book so I dont think we should display it in the plain text.

What should we do

If you've read this far @mlissner I know you must be dying to hear what I think the solution happens to be.

We should drop (I think) pdftotext for you guessed it pdfplumber.

Pdfplumber can both sample the pdfs better to determine if the entire page is likely an image - while correctly guessing that lines or signatures are in the document and leaving be. Additionally, we can easily extract out the pure text of the document while avoiding the pitfalls contained.

We should drop the check in CL and just make all the assessments done here in doctor as well.

Solutions coming in the next post.

mlissner commented 5 months ago

Solutions coming in the next post.

I think this sounds really hard, and I think before we build something with pdfplumber, we should look into doing "structural PDF extraction," which I think has come a long ways since I grabbed pdftotext out of the "apt store."

Can you survey what kinds of tools are already out there and see if there are ones that would already work for us before we go down what I think is going to be a very scary road? I'm pretty terrified that the corner cases on this issue could bog you down for months and still not be nailed down.

halfprice06 commented 4 months ago

@flooie @mlissner

I think some inspiration maybe can be taken from the following repo:

https://github.com/VikParuchuri/surya

Vik has been doing some awesome work on OCR and document structure recognition, reading order, etc.

His twitter:

https://twitter.com/VikParuchuri

flooie commented 4 months ago

Here are a list of improvements we should immediately see in the forms of text. I want to document the type of issues we will hopefully improve upon and which our users will ultimately appreciate.

  1. Example 1 Screenshot 2024-05-14 at 4 50 00 PM

We've greatly improved the indentation and spacing of text from tesseract. This example from Vermont illustrates that we can more or less retain the basic structure of the first page. This is useful for numerous reasons.

  1. we keep the ")" line from interfering with the text and can instantly parse out the plaintiff and defendant
  2. The case number doesnt get smushed into the V. or other text on the page - which
  3. will allow for better parsing after the fact.
  4. Speaking of the stamp in the upper right hand corner won't be left aligned into the content of the page. This is often a problem we have in our documents. This will also improve post import parsing for us and our users.
  5. Our use of confidence to review the OCR - removes the artifacts "nenenenene" from the page - which in this case is a line.
  6. The page correctly spaces and indents new paragraphs. (more on that in the next posts)
flooie commented 4 months ago

Pages with line numbers have been a huge challenge for tesseract (or us) historically)

Screenshot 2024-05-14 at 4 56 14 PM

here is a random document from Wisconsin I think, but look how we have now properly identified and linked of content across the page. This not only is correct but it reduces the overall length of the document by nearly 50%.

flooie commented 4 months ago

Focus on newlines and paragraphs - make this once unreadable or atleast difficult to parse into a beauty.

We can now clearly see (and hopefully parse) out the block text and long quotes more easily. But you can really see how the lack of consistent vertical spacing would make this really hard to digest for a user or a computer

Screenshot 2024-05-14 at 4 59 30 PM
flooie commented 4 months ago

Furthermore - we now should be able to reprocess these and identify relatively confidently footnotes - based on the length of the line. They stand out as smaller font because they end up being longer than the rest of the content.

Screenshot 2024-05-14 at 5 07 32 PM
flooie commented 4 months ago

Lastly we implemented square white boxes where necessary when we think something is not an artifact but we cant get above 10% confidence for what it is. In this case - we failed because it's fuzzy - skewed and partially covered in handwriting.

Screenshot 2024-05-14 at 5 11 34 PM

This can and should be used to indicate overall quality

flooie commented 4 months ago

@mlissner here are some of the improvements for you.

mlissner commented 4 months ago

Those are some very nice improvements!

flooie commented 4 months ago

one more push coming momentarily with the last few changes

flooie commented 4 months ago

Lastly, I was able to finish implementing a smoothing out of the case caption lines on the first page which I think provides some professional looking OCR. It only works for a few symbols common in bigger courts - and doesnt handle horizontal lines - but I think we can eventually reprocess those

Screenshot 2024-05-15 at 2 00 49 PM Screenshot 2024-05-15 at 2 00 55 PM Screenshot 2024-05-15 at 2 01 01 PM Screenshot 2024-05-15 at 2 01 22 PM Screenshot 2024-05-15 at 2 01 17 PM Screenshot 2024-05-15 at 2 01 13 PM Screenshot 2024-05-15 at 2 01 06 PM
mlissner commented 5 days ago

@flooie, this is still open. Can you please figure out what's left to do here and make a comment or new issue if anything (and close otherwise)?