Princeton-CDH / ppa-nlp

Text mining the stories of poetic forms

1 stars 0 forks source link

Code and Prodigy configuration for poem annotation task #39

Open rlskoeser opened 5 days ago

rlskoeser commented 5 days ago

Adapt from Wouter's recipe and prep code shared in Google Drive; code will be added to ppa-nlp/corppa

[x] update ansible playbook to serve out images for annotation
[x] add utility script to add metadata to page data for prodigy
[x] add custom css to display image+text side by side

separate but related tasks

46
reconciling the different data prep scripts (create_pageset, filter, add_metadata)
scripted / replicable solution for selecting a specified number of random pages from a jsonl corpus

rlskoeser commented 4 days ago

progress so far:

added Wouter's code to a branch with a co-author commit as a starting point
updated prodigy ansible playbook to serve out ppa images from tigerdata (ran into permissions problems) and proxy the prodigy app running on port 8080
requested PUL nginxplus config change to proxy port 80 instead of 8080 so it will get both images and prodigy app
using Laure's version of the poem test set pages that I filtered to just the pages that contain pages, for easier testing
added custom css file and configured prodigy recipe to include it; using css grid to display image and text side by side
running prodigy with the recipe from a branch of corppa code

still to do:

clean up / reorganize the code
adapt and use Wouter's logic to update the jsonl file to add work-level metadata where Prodigy expects it
- make sure css customization works with Prodigy metadata block
update recipe to reference images from server path
figure out how to manage the data file we need to reference when we start prodigy
...

blockers:

I'm getting an error trying to save annotations on the images and text - they work separately but not together; I opened a support request on the Prodigy Support forum

rlskoeser commented 4 days ago

Here's the mogrify command I was using for bulk convert + resize of the Gale images: mogrify -format jpg -resize 500 */*.TIF

We probably want to resize the HT images as well, I think the versions we have now are quite a bit larger than we need for this interface.

rlskoeser commented 3 days ago

I've updated this so the tasks we're putting off and/or need to discuss and make decisions about are separate tasks, but the code for this is all in branches that I think should be reviewed, so I'm putting it under review.

rlskoeser commented 3 days ago

Using command line tools to make a test set of pages for annotation (or at least training on) with a 176 pages from HT known to contain poetry and 200 other pages from the same volume.

I used the new filter script options to split the poetry test set into pages known to contain poetry and other pages. I used shuf -n 200 to select a random 200 lines from the non/unknown poetry set, cat to combine the 176 poetry and 200 other pages into a single jsonl file, and then used shuf to randomize the order in that one (in case it mattes for Prodigy - although actually IDK if it does).

I've put this file on tigerdata as poem_testset_300mixed_forprodigy_shuf.jsonl in poetry-detection/poem-focused-testset/ and have updated the test prodigy instance to use this datafile as input.

Princeton-CDH / ppa-nlp

Code and Prodigy configuration for poem annotation task #39

separate but related tasks

46