Princeton-CDH / ppa-nlp

Discovering patterns in poetry’s data with machine learning; software for use with Princeton Prosody Archive (PPA) full-text corpus
1 stars 0 forks source link

As an NLP expert, I want to review the text corpus of the PPA test set, the existing preprocessing code, the decision log, and existing literature on OCR quality so that I can offer a recommendation on how the team should preprocess the text corpus moving forward. #14

Closed jerielizabeth closed 4 months ago

jerielizabeth commented 4 months ago

Dependent on info from :

~Outcome - Answers to the following questions:~

~- [ ] do we need to revise any previous decisions about preprocessing~ ~- [ ] do we need to modify / create code for preprocessing~

Actual outcomes

mnaydan commented 4 months ago

Here is the decision log, which I pulled together from past meeting notes.

For reference, here is the existing preprocessing code in the cleaning.py script on the repo, which is copy and pasted from Wouter's colab notebook.

And here is the updated (new Hathi OCR) text corpus filtered for the test set.

laurejt commented 4 months ago

Literature (see group Zotero library):

Code Repositories: