FLVC: Check book and newspaper issues on sites that had OCR turned off - Githubissues

FLVC / flvc

FLVC-specific Islandora Hooks

0 stars 2 forks source link

FLVC: Check book and newspaper issues on sites that had OCR turned off #94

Closed wrandtkeflvc closed 6 months ago

wrandtkeflvc commented 5 years ago

This is follow up to CAS-71623-C1N6.

30 sites across test and production sites have been configured not to do OCR on pages loaded through the zip loader. This affects Book Content Model and Newspaper Issue Content Model objects.

For each site, need to determine whether any books or newspaper issues were loaded with the zip loader while OCR was off, and then run OCR on those. Most sites are newer or hold mostly migrated Digitool materials, so this is probably a minimal amount of objects affected.

Here's a list of the production sites affected: https://fgcu.digital.flvc.org https://hccfl.digital.flvc.org https://fiu.digital.flvc.org https://gcsc.digital.flvc.org https://irsc.digital.flvc.org https://lssc.digital.flvc.org https://nwfsc.digital.flvc.org https://scf.digital.flvc.org https://spc.digital.flvc.org (German was the only extra language checked off, and actually OCR works with German, English, French, and Italian as options) https://uwf.digital.flvc.org https://palmm.digital.flvc.org (probably doesn't matter, since this one is just pulling objects from other sites)

Here's a list of the test sites affected (I've left test sites on, so that later if there is no OCR datastream, it's possible to know that that's because of this and not some kind of thing where that datastream got deleted or won't show in search results or something): https://fgcu-test.digital.flvc.org https://hccfl-test.digital.flvc.org https://broward-test.digital.flvc.org https://famu-test.digital.flvc.org https://fau-test.digital.flvc.org https://fiu-test.digital.flvc.org https://fscj-test.digital.flvc.org https://gcsc-test.digital.flvc.org https://gcsc.digital.flvc.org https://irsc-test.digital.flvc.org https://lssc-test.digital.flvc.org https://nwfsc-test.digital.flvc.org https://scf-test.digital.flvc.org https://spc-test.digital.flvc.org (German was the only extra language checked off, and actually OCR works with German, English, French, and Italian as options) https://uf-test.digital.flvc.org https://ucf-test.digital.flvc.org https://unf-test.digital.flvc.org https://usf-test.digital.flvc.org https://uwf-test.digital.flvc.org https://palmm-test.digital.flvc.org (probably doesn't matter, since this one is just pulling objects from other sites)

Recommended step to fix: This has to have programmer assistance. Maybe do a query for the OCR datastream on page objects, and flag when 5 consecutive PIDS have a 0 MB OCR datastream.

(Was CRM no. CAS-71642-C9G0 ; https://flvc.crm.dynamics.com:443/main.aspx?etc=112&id=12acd0bb-1745-e611-80fd-3863bb34eb30&histKey=698582926&newWindow=true&pagetype=entityrecord )