Princeton-CDH / ppa-django

Princeton Prosody Archive v3.x - Python/Django web application
http://prosody.princeton.edu
Apache License 2.0
4 stars 2 forks source link

Improve Gale indexing speed for local OCR #673

Closed jerielizabeth closed 2 days ago

jerielizabeth commented 1 month ago

Estimating time difference in indexing Gale volumes with current fix:

Test case: index 10 Gale volumes

  • Volumes: CB0127060085 CB0126086107 CB0127060264 CB0128005380 CB0128002412 CW0115524470 CW0111606033 CW0114828122 CW0125924161 CW0110904426

  • Pages: 1,762 pages

Timing:

  • Production (original): 11 seconds

  • Test (new): 1 minutes 14 seconds

The new indexing is substantially slower and should be optimized in the future.

Originally posted by @laurejt in #668

rlskoeser commented 1 week ago

I've been doing some testing in staging to see how the revised OCR file structure impacts the time to complete indexing all of our Gale content.

The first time I ran it, it took a little over an hour to complete - a substantial improvement over the ~5 hours I saw most recently in production. I was curious if tigerdata was slowing this down substantially, so I copied the OCR json files to the local filesystem and ran it again - the time was basically the same as when indexing with OCR files on tigerdata. I suspected there are enough slow downs between filesystem reads and waiting on the Gale API (which can be slow for larger volumes) that cpu / computation isn't the bottleneck, so I tested indexing with 4 processes (even though we only have 2 cpus on this vm), and indexing completed in 30 minutes.

That seems like a reasonable enough improvement to me. When we set up the cron job we should use -p 4 to index with 4 processes.