Princeton-CDH / ppa-django

Princeton Prosody Archive v3.x - Python/Django web application
http://prosody.princeton.edu
Apache License 2.0
4 stars 2 forks source link

Gale local ocr code doesn't handle json decode error #692

Open rlskoeser opened 2 days ago

rlskoeser commented 2 days ago

Apparently some of the collated ocr json files were corrupt, and the page indexing code doesn't handle that error. For some reason the problem didn't occur when testing in staging, likely because the contents of production and staging are somewhat different (although even after attempting to replicate data I could not reproduce the error).

I did a quick edit in production to catch the error and report the volumes that are causing problems, these are the ids:

I've deleted the bad JSON files from the directories on tigerdata, which allowed indexing to proceed using Gale OCR. Remaining work: