Princeton-CDH / ppa-nlp

Text mining the stories of poetic forms
1 stars 0 forks source link

Code and Prodigy configuration for poem annotation task #39

Open rlskoeser opened 5 days ago

rlskoeser commented 5 days ago

Adapt from Wouter's recipe and prep code shared in Google Drive; code will be added to ppa-nlp/corppa

separate but related tasks

rlskoeser commented 4 days ago

progress so far:

still to do:

blockers:

rlskoeser commented 4 days ago

Here's the mogrify command I was using for bulk convert + resize of the Gale images: mogrify -format jpg -resize 500 */*.TIF

We probably want to resize the HT images as well, I think the versions we have now are quite a bit larger than we need for this interface.

rlskoeser commented 3 days ago

I've updated this so the tasks we're putting off and/or need to discuss and make decisions about are separate tasks, but the code for this is all in branches that I think should be reviewed, so I'm putting it under review.

rlskoeser commented 3 days ago

Using command line tools to make a test set of pages for annotation (or at least training on) with a 176 pages from HT known to contain poetry and 200 other pages from the same volume.

I used the new filter script options to split the poetry test set into pages known to contain poetry and other pages. I used shuf -n 200 to select a random 200 lines from the non/unknown poetry set, cat to combine the 176 poetry and 200 other pages into a single jsonl file, and then used shuf to randomize the order in that one (in case it mattes for Prodigy - although actually IDK if it does).

I've put this file on tigerdata as poem_testset_300mixed_forprodigy_shuf.jsonl in poetry-detection/poem-focused-testset/ and have updated the test prodigy instance to use this datafile as input.