freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
548 stars 150 forks source link

PACER OCR test regex needs expansion (and OCR needs to be completed for exluded docs) #598

Open mlissner opened 8 years ago

mlissner commented 8 years ago

We have a super simple OCR system right now for PACER docs. Basically, we extract the text using pdftotext, and then remove any line that starts with "Case", which removes the headers on most documents:

Case 2:06-cv-00376-SRW Document 1-2 Filed 04/25/2006 Page 1 of 1

If there's still text remaining, we assume OCR isn't needed.

Alas, some documents (mostly in bankruptcy courts, have two lines for their header that look more like:

Case 2:16-bk-24364-NB Doc 1 Filed 10/31/16 Entered 10/31/16 11:52:49 Desc Main Document Page 1 of 9

About 50,000 of these items need OCR but didn't get it, so we need to make the OCR testing algorithm a little better, and we need to run OCR on these items.

mlissner commented 7 years ago

When the time comes to fix this, the code is here:

https://github.com/freelawproject/courtlistener/blob/252ff7f4a1e1627c9ee75e5e82de8361bdff9492/cl/lib/recap_utils.py#L76

mlissner commented 7 years ago

Another example failing the OCR test: https://www.courtlistener.com/docket/4538240/1/3/bison-resources-corporation-v-antero-resources-corporation/

mlissner commented 6 years ago

And another example: https://ia800304.us.archive.org/4/items/gov.uscourts.cacb.1466705/gov.uscourts.cacb.1466705.1.0.pdf

mlissner commented 6 months ago

When I filed this about eight years ago, it affected something like 50k cases. Now it's about 1.3M that are in the same search query. We should probably do something about this.

I spent some time today looking at some of these. I think the following is a really sloppy way to improve the needs_ocr function, but it definitely helps:

def needs_ocr(content):
    """Determines if OCR is needed for a PACER PDF.

    Every document in PACER (pretty much) has the case number written on the
    top of every page. This is a great practice, but it means that to test if
    OCR is needed, we need to remove this text and see if anything is left. The
    line usually looks something like:

        Case 2:06-cv-00376-SRW Document 1-2 Filed 04/25/2006 Page 1 of 1
        Appeal: 15-1504 Doc: 6 Filed: 05/12/2015 Pg: 1 of 4
        Appellate Case: 14-3253 Page: 1 Date Filed: 01/14/2015 Entry ID: 4234486
        USCA Case #16-1062 Document #1600692 Filed: 02/24/2016 Page 1 of 3
        USCA11 Case: 21-12355 Date Filed: 07/13/202 Page: 1 of 2

    Some bankruptcy cases also have two-line headers due to the document
    description being in the header. That means the second line often looks
    like:

        Page 1 of 90
        Main Document Page 1 of 16
        Document     Page 1 of 12
        Invoices Page 1 of 57
        A - RLF Invoices Page 1 of 83
        Final Distribution Report Page 1 of 5

    This function removes these lines so that if no text remains, we can be sure
    that the PDF needs OCR.

    :param content: The content of a PDF.
    :return: boolean indicating if OCR is needed.
    """
    bad_starters = ("Appellate", "Appeal", "Case", "Page", "USCA", )
    pagination_re = re.compile(r"Page\s+\d+\s+of\s+\d+")
    for line in content.splitlines():
        line = line.strip()
        if line.startswith(bad_starters):
            continue
        elif pagination_re.search(line):
            continue
        elif line:
            # We found a line with good content. No OCR needed.
            return False

    # We arrive here if no line was found containing good content.
    return True

It'd be good to do a little more work on this. Maybe pull 100 random items and see what low-hanging fruit we can find. Doesn't have to be perfect, but better would be nice.

mlissner commented 3 months ago

When we get to this, let's make sure to consult @flooie, since he's been working in this area.