issues
search
Princeton-CDH
/
ppa-nlp
Discovering patterns in poetry’s data with machine learning; software for use with Princeton Prosody Archive (PPA) full-text corpus
1
stars
0
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
deploy prodigy to production environment
#49
jerielizabeth
closed
2 months ago
2
Review guidelines for annotation task
#48
mnaydan
closed
2 months ago
0
configure prodigy for SSO
#47
jerielizabeth
opened
2 months ago
6
automate page image conversion and resizing
#46
rlskoeser
opened
3 months ago
2
Gather images and text for 35 overlap items for CHR paper
#44
jerielizabeth
closed
2 months ago
0
Prodigy poetry annotation recipes and data prep
#43
rlskoeser
closed
2 months ago
0
Revise filter script with options to include/exclude based on page attributes
#42
rlskoeser
closed
2 months ago
1
move util functions in scripts/helper.py to corppa utils
#41
rlskoeser
closed
1 month ago
0
Added helper function for stub directory logic
#40
laurejt
closed
3 months ago
0
Code and Prodigy configuration for poem annotation task
#39
rlskoeser
closed
2 months ago
6
review Hale script to prepare to re-OCR Gale material using Google Vision API
#38
rlskoeser
closed
2 months ago
0
mark single author/single poems in PPA metadata
#37
jerielizabeth
closed
1 month ago
1
transfer ECCO data from harddrives to SSD
#36
jerielizabeth
closed
1 month ago
1
Get Laure the new data export with ESTCs from ECCO-TCP
#35
jerielizabeth
closed
3 months ago
1
Outline CHR paper for OCR
#34
jerielizabeth
closed
3 months ago
1
Sketch out the NLP experimental steps for PPA
#33
jerielizabeth
closed
3 months ago
0
Release/0.1
#32
laurejt
closed
4 months ago
0
Feature/scripts update
#31
laurejt
closed
4 months ago
2
Scripts update & cleanup
#30
laurejt
closed
4 months ago
1
Gather new poem-focused PPA test set in a spreadsheet
#29
mnaydan
closed
4 months ago
1
Write a script to aggregate the results
#27
mnaydan
closed
2 months ago
0
Get an image and text aligned dataset for the new test set
#26
mnaydan
closed
4 months ago
1
Initial commit. Note that shared utility methods have been moved.
#25
laurejt
closed
4 months ago
0
Augment PPA Character Stats with document frequency
#24
mnaydan
closed
4 months ago
0
Create PPA page-level OCR quality evaluation
#23
mnaydan
closed
4 months ago
1
Take John Foley's code out for a spin
#22
mnaydan
closed
2 months ago
0
As an NLP expert, I want to assess the OCR quality of the pages in the test set so that I can offer a data-based recommendation on whether to re-OCR certain volumes.
#21
mnaydan
opened
5 months ago
0
I want a character normalization strategy
#20
mnaydan
opened
5 months ago
0
I want a list of UTF-8 characters in the corpus and their frequencies
#19
mnaydan
closed
5 months ago
1
Investigate line length/word counts to see how far this method gets in detecting poetry
#18
mnaydan
closed
2 months ago
0
revise the corppa filter script to work with the new stabilized unique work IDs for excerpts and articles
#17
mnaydan
closed
1 week ago
1
As an NLP expert, I want to review the text corpus of the PPA test set, the existing preprocessing code, the decision log, and existing literature on OCR quality so that I can offer a recommendation on how the team should preprocess the text corpus moving forward.
#14
jerielizabeth
closed
5 months ago
2
As a researcher, I want the option of applying OCR cleanup rules to the corpus so that my computational analysis will yield more accurate results.
#13
mnaydan
closed
2 months ago
0
As a researcher, I want a list of possible NLP tools to try so that I can investigate their utility for answering our research questions.
#12
mnaydan
closed
5 months ago
6
As a researcher, I want a test set of 20-25 works with definitions so that I can investigate methods for identifying direct and indirect citation.
#11
mnaydan
closed
5 months ago
1
Write a script to get page images for the PPA dataset off the ECCO hard drives and into TigerData
#10
mnaydan
closed
4 months ago
1
Investigate utility of DINOv2 on page images from the test set
#9
mnaydan
closed
5 months ago
2
Functionality to filter/subset PPA full-text corpus by source id
#8
rlskoeser
closed
6 months ago
5
initial setup for corppa python package
#7
rlskoeser
closed
6 months ago
0
As a researcher, I want a set of 20-25 representative texts so that I can conduct controlled experiments on a sample corpus.
#6
mnaydan
closed
5 months ago
6
As a researcher, I want to be able to get the text corpus for a subset of record IDs so that I can conduct textual analysis within particular groups of texts.
#5
mnaydan
closed
6 months ago
5
Reading corpus from jsonl file exported by ppa-django + misc. more
#3
quadrismegistus
closed
1 month ago
1
Develop
#1
quadrismegistus
closed
6 months ago
4
pull page level genre metadata for HT volumes in the for-use-in-schools dataset
#4
mnaydan
closed
2 months ago
2
Previous