freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
545 stars 151 forks source link

Came across a non-searchable pdf #116

Closed freelawbot closed 10 years ago

freelawbot commented 10 years ago

I've not seen this before, but it turns out that the Arista Records 2nd Circuit opinion is not searchable text and so it doesn't parse. See: http://www.ca2.uscourts.gov/decisions/isysquery/1373c2ba-7ccb-4426-a68f-a82861dbd4ed/1/doc/09-0905-cv_opn.pdf

We should have an extra check on the end of the parser run that checks to see if it put any text into the document text field. If not, it should produce a notice on that opinion page along the lines of:

"We were unable to parse this pdf from the court. Perhaps it is not searchable text. You may download the pdf [from the court|hyperlink] or from [our backup|hyperlink]."

I wouldn't want to put this message IN to the document text field of the database as then any words we use in our message would show up in searches, but there should be some sort of flag in the template that we can set off to display this message when needed.


freelawbot commented 10 years ago

This will be resolved as soon as the system can run the OCR over the corpus.


Original Comment By: Mike Lissner

freelawbot commented 10 years ago

When the OCR script is ready, it would probably be a good idea to implement it using celery: http://django-celery.readthedocs.org/en/latest/getting-started/first-steps-with-django.html

As I understand it, this will allow the processing to be moved to the background as a low priority task, which will probably be useful if we get to the point where a lot of OCR is being done.


Original Comment By: Mike Lissner

freelawbot commented 10 years ago

Some good cleanup code is here: https://github.com/documentcloud/docsplit/blob/master/lib/docsplit/text_cleaner.rb


Original Comment By: Mike Lissner

freelawbot commented 10 years ago

These folks have an API that should be investigated: http://blog.documentcloud.org/blog/2010/11/improving-the-quality-of-ocr/

and "The OCR cleanup code is available as part of the latest release of our Docsplit project." suggests that there may also be code worth reviewing. Their links to academic papers re OCR are also worth remembering.


Original Comment By: Brian Carver

freelawbot commented 10 years ago

Working on this bug a bit. It looks like the Google solution above will only import the doc into Google Docs, so that's not useful.

Google does however, sponsor the OCRopus project, but it looks more complicated than we need: https://code.google.com/p/ocropus/

There are some other solutions I'll post later that involve us doing OCR, but none of them that I have found so far are very good.


Original Comment By: Mike Lissner

freelawbot commented 10 years ago

Saw this today: Google is offering an OCR API. It uploads the text and the doc to Google docs, and has a quota, but given the number of documents that have this problem, it could be a decent solution.

http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#OCR


Original Comment By: Mike Lissner

freelawbot commented 10 years ago

Altlaw seems to have come across this same problem. Their approach detailed here: http://lawcommons.org/trac/ticket/7


Original Comment By: Mike Lissner

freelawbot commented 10 years ago

http://courtlistener.com/ca2/Arista-Records-LLC-v.-Doe-3/


Original Comment By: Mike Lissner

freelawbot commented 10 years ago

It looks like in this case, there is at least some text coming from the document (although very little).

Looks like we'll need to have some threshold under which we assume it's images. The 'length' template tag will make this easy though.


Original Comment By: Mike Lissner