Open rlskoeser opened 5 days ago
progress so far:
still to do:
blockers:
Here's the mogrify command I was using for bulk convert + resize of the Gale images:
mogrify -format jpg -resize 500 */*.TIF
We probably want to resize the HT images as well, I think the versions we have now are quite a bit larger than we need for this interface.
I've updated this so the tasks we're putting off and/or need to discuss and make decisions about are separate tasks, but the code for this is all in branches that I think should be reviewed, so I'm putting it under review.
Using command line tools to make a test set of pages for annotation (or at least training on) with a 176 pages from HT known to contain poetry and 200 other pages from the same volume.
I used the new filter script options to split the poetry test set into pages known to contain poetry and other pages. I used shuf -n 200
to select a random 200 lines from the non/unknown poetry set, cat
to combine the 176 poetry and 200 other pages into a single jsonl file, and then used shuf
to randomize the order in that one (in case it mattes for Prodigy - although actually IDK if it does).
I've put this file on tigerdata as poem_testset_300mixed_forprodigy_shuf.jsonl
in poetry-detection/poem-focused-testset/
and have updated the test prodigy instance to use this datafile as input.
Adapt from Wouter's recipe and prep code shared in Google Drive; code will be added to ppa-nlp/corppa
separate but related tasks
46