Apparently some of the collated ocr json files were corrupt, and the page indexing code doesn't handle that error. For some reason the problem didn't occur when testing in staging, likely because the contents of production and staging are somewhat different (although even after attempting to replicate data I could not reproduce the error).
I did a quick edit in production to catch the error and report the volumes that are causing problems, these are the ids:
CB0127495592
CW0110723387
CW0117319378
I've deleted the bad JSON files from the directories on tigerdata, which allowed indexing to proceed using Gale OCR. Remaining work:
[x] regenerate collated ocr files for these three volumes and add to the correct space on tigerdata
[x] check in and test the code change for catching and reporting on json decode errors
[ ] update tests to confirm expected logging behavior
Apparently some of the collated ocr json files were corrupt, and the page indexing code doesn't handle that error. For some reason the problem didn't occur when testing in staging, likely because the contents of production and staging are somewhat different (although even after attempting to replicate data I could not reproduce the error).
I did a quick edit in production to catch the error and report the volumes that are causing problems, these are the ids:
I've deleted the bad JSON files from the directories on tigerdata, which allowed indexing to proceed using Gale OCR. Remaining work: