Princeton-CDH / ppa-nlp

Discovering patterns in poetry’s data with machine learning; software for use with Princeton Prosody Archive (PPA) full-text corpus
1 stars 0 forks source link

Write scripts to run basic text reuse pipeline #67

Closed mnaydan closed 1 month ago

mnaydan commented 2 months ago

I/O

Passim-specific

Evaluation

mnaydan commented 1 month ago

@laurejt I am testing the jsonl. Could you please reply with a comment here specifying the acceptance criteria? Are new lines important? What else am I looking for?

laurejt commented 1 month ago

@mnaydan The immediate goals of testing the jsonl is to exam the "text" field and confirm that it corresponds to the "full text" of the poem. I'm not sure how best to compare this beyond finding an external copy from somewhere else and checking that it "looks" right.

mnaydan commented 1 month ago

@laurejt thank you, that is helpful! I used Visual Studio Code and json-lines-viewer.preview to read the jsonl, and spot checked a dozen or so poems. The text field does correspond to the full text of the poem as I would expect it, so I would consider this "tested" and "accepted."

jerielizabeth commented 1 month ago

Issues split out from this task