Closed rlskoeser closed 23 hours ago
@laurejt determined as part of reviewing the draft Google Vision API script that we don't need to resize the Gale images for OCR purposes. We will still need to convert from TIFF to a supported web format for display in the Prodigy annotation interface, and we may want to resize the HT images (or download a different size going forward) because the ones we have now are larger than we need, but that's less urgent.
I think we can keep this issue open - it's not a blocker for ocr but it will be needed for annotation.
Dependent on Wouter's findings of usefulness of image annotations, important for round 3 especially.
Page images need to be converted for annotation in Prodigy.
For Gale images, this is a blocker (most browsers won't display TIFFs); for HathiTrust images, this is an efficiency / data transfer size concern.
We can use ImageMagick / mogrify to batch convert; example command for Gale TIFFs:
We want to a one-time conversion (in batches, as needed) and store them on TigerData so we can display them in Prodigy. We'll need to come up with a naming convention (e.g. image_w500.jpg) or possibly parallel directory structure so we can reference the appropriate image, and then we'll need to update the jsonl data we pass to Prodigy to use the correct path.
Unclear where / how we want to handle this conversion. It could be implemented as a step in the cdh-ansible prodigy deploy, or it could be managed by a script in
corppa
.