Closed jerielizabeth closed 2 days ago
I've been doing some testing in staging to see how the revised OCR file structure impacts the time to complete indexing all of our Gale content.
The first time I ran it, it took a little over an hour to complete - a substantial improvement over the ~5 hours I saw most recently in production. I was curious if tigerdata was slowing this down substantially, so I copied the OCR json files to the local filesystem and ran it again - the time was basically the same as when indexing with OCR files on tigerdata. I suspected there are enough slow downs between filesystem reads and waiting on the Gale API (which can be slow for larger volumes) that cpu / computation isn't the bottleneck, so I tested indexing with 4 processes (even though we only have 2 cpus on this vm), and indexing completed in 30 minutes.
That seems like a reasonable enough improvement to me. When we set up the cron job we should use -p 4
to index with 4 processes.
Originally posted by @laurejt in #668