Princeton-CDH / ppa-nlp

Discovering patterns in poetry’s data with machine learning; software for use with Princeton Prosody Archive (PPA) full-text corpus
1 stars 0 forks source link

automate page image conversion and resizing #46

Open rlskoeser opened 2 months ago

rlskoeser commented 2 months ago

Page images need to be converted for annotation in Prodigy.

For Gale images, this is a blocker (most browsers won't display TIFFs); for HathiTrust images, this is an efficiency / data transfer size concern.

We can use ImageMagick / mogrify to batch convert; example command for Gale TIFFs:

 mogrify -format jpg -resize 500 */*.TIF

We want to a one-time conversion (in batches, as needed) and store them on TigerData so we can display them in Prodigy. We'll need to come up with a naming convention (e.g. image_w500.jpg) or possibly parallel directory structure so we can reference the appropriate image, and then we'll need to update the jsonl data we pass to Prodigy to use the correct path.

Unclear where / how we want to handle this conversion. It could be implemented as a step in the cdh-ansible prodigy deploy, or it could be managed by a script in corppa.

rlskoeser commented 4 weeks ago

@laurejt determined as part of reviewing the draft Google Vision API script that we don't need to resize the Gale images for OCR purposes. We will still need to convert from TIFF to a supported web format for display in the Prodigy annotation interface, and we may want to resize the HT images (or download a different size going forward) because the ones we have now are larger than we need, but that's less urgent.

I think we can keep this issue open - it's not a blocker for ocr but it will be needed for annotation.