Open mlissner opened 8 years ago
When the time comes to fix this, the code is here:
Another example failing the OCR test: https://www.courtlistener.com/docket/4538240/1/3/bison-resources-corporation-v-antero-resources-corporation/
When I filed this about eight years ago, it affected something like 50k cases. Now it's about 1.3M that are in the same search query. We should probably do something about this.
I spent some time today looking at some of these. I think the following is a really sloppy way to improve the needs_ocr function, but it definitely helps:
def needs_ocr(content):
"""Determines if OCR is needed for a PACER PDF.
Every document in PACER (pretty much) has the case number written on the
top of every page. This is a great practice, but it means that to test if
OCR is needed, we need to remove this text and see if anything is left. The
line usually looks something like:
Case 2:06-cv-00376-SRW Document 1-2 Filed 04/25/2006 Page 1 of 1
Appeal: 15-1504 Doc: 6 Filed: 05/12/2015 Pg: 1 of 4
Appellate Case: 14-3253 Page: 1 Date Filed: 01/14/2015 Entry ID: 4234486
USCA Case #16-1062 Document #1600692 Filed: 02/24/2016 Page 1 of 3
USCA11 Case: 21-12355 Date Filed: 07/13/202 Page: 1 of 2
Some bankruptcy cases also have two-line headers due to the document
description being in the header. That means the second line often looks
like:
Page 1 of 90
Main Document Page 1 of 16
Document Page 1 of 12
Invoices Page 1 of 57
A - RLF Invoices Page 1 of 83
Final Distribution Report Page 1 of 5
This function removes these lines so that if no text remains, we can be sure
that the PDF needs OCR.
:param content: The content of a PDF.
:return: boolean indicating if OCR is needed.
"""
bad_starters = ("Appellate", "Appeal", "Case", "Page", "USCA", )
pagination_re = re.compile(r"Page\s+\d+\s+of\s+\d+")
for line in content.splitlines():
line = line.strip()
if line.startswith(bad_starters):
continue
elif pagination_re.search(line):
continue
elif line:
# We found a line with good content. No OCR needed.
return False
# We arrive here if no line was found containing good content.
return True
It'd be good to do a little more work on this. Maybe pull 100 random items and see what low-hanging fruit we can find. Doesn't have to be perfect, but better would be nice.
When we get to this, let's make sure to consult @flooie, since he's been working in this area.
We have a super simple OCR system right now for PACER docs. Basically, we extract the text using
pdftotext
, and then remove any line that starts with "Case", which removes the headers on most documents:If there's still text remaining, we assume OCR isn't needed.
Alas, some documents (mostly in bankruptcy courts, have two lines for their header that look more like:
About 50,000 of these items need OCR but didn't get it, so we need to make the OCR testing algorithm a little better, and we need to run OCR on these items.