Closed freelawbot closed 10 years ago
This will be resolved as soon as the system can run the OCR over the corpus.
Original Comment By: Mike Lissner
When the OCR script is ready, it would probably be a good idea to implement it using celery: http://django-celery.readthedocs.org/en/latest/getting-started/first-steps-with-django.html
As I understand it, this will allow the processing to be moved to the background as a low priority task, which will probably be useful if we get to the point where a lot of OCR is being done.
Original Comment By: Mike Lissner
Some good cleanup code is here: https://github.com/documentcloud/docsplit/blob/master/lib/docsplit/text_cleaner.rb
Original Comment By: Mike Lissner
These folks have an API that should be investigated: http://blog.documentcloud.org/blog/2010/11/improving-the-quality-of-ocr/
and "The OCR cleanup code is available as part of the latest release of our Docsplit project." suggests that there may also be code worth reviewing. Their links to academic papers re OCR are also worth remembering.
Original Comment By: Brian Carver
Working on this bug a bit. It looks like the Google solution above will only import the doc into Google Docs, so that's not useful.
Google does however, sponsor the OCRopus project, but it looks more complicated than we need: https://code.google.com/p/ocropus/
There are some other solutions I'll post later that involve us doing OCR, but none of them that I have found so far are very good.
Original Comment By: Mike Lissner
Saw this today: Google is offering an OCR API. It uploads the text and the doc to Google docs, and has a quota, but given the number of documents that have this problem, it could be a decent solution.
http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#OCR
Original Comment By: Mike Lissner
Altlaw seems to have come across this same problem. Their approach detailed here: http://lawcommons.org/trac/ticket/7
Original Comment By: Mike Lissner
It looks like in this case, there is at least some text coming from the document (although very little).
Looks like we'll need to have some threshold under which we assume it's images. The 'length' template tag will make this easy though.
Original Comment By: Mike Lissner
I've not seen this before, but it turns out that the Arista Records 2nd Circuit opinion is not searchable text and so it doesn't parse. See: http://www.ca2.uscourts.gov/decisions/isysquery/1373c2ba-7ccb-4426-a68f-a82861dbd4ed/1/doc/09-0905-cv_opn.pdf
We should have an extra check on the end of the parser run that checks to see if it put any text into the document text field. If not, it should produce a notice on that opinion page along the lines of:
"We were unable to parse this pdf from the court. Perhaps it is not searchable text. You may download the pdf [from the court|hyperlink] or from [our backup|hyperlink]."
I wouldn't want to put this message IN to the document text field of the database as then any words we use in our message would show up in searches, but there should be some sort of flag in the template that we can set off to display this message when needed.