Princeton-CDH ppa-nlp issues

Princeton-CDH / ppa-nlp

Discovering patterns in poetry’s data with machine learning; software for use with Princeton Prosody Archive (PPA) full-text corpus

1 stars 0 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

deploy prodigy to production environment

#49 jerielizabeth closed 2 months ago
2
Review guidelines for annotation task

#48 mnaydan closed 2 months ago
0
configure prodigy for SSO

#47 jerielizabeth opened 2 months ago
6
automate page image conversion and resizing

#46 rlskoeser opened 3 months ago
2
Gather images and text for 35 overlap items for CHR paper

#44 jerielizabeth closed 2 months ago
0
Prodigy poetry annotation recipes and data prep

#43 rlskoeser closed 2 months ago
0
Revise filter script with options to include/exclude based on page attributes

#42 rlskoeser closed 2 months ago
1
move util functions in scripts/helper.py to corppa utils

#41 rlskoeser closed 1 month ago
0
Added helper function for stub directory logic

#40 laurejt closed 3 months ago
0
Code and Prodigy configuration for poem annotation task

#39 rlskoeser closed 2 months ago
6
review Hale script to prepare to re-OCR Gale material using Google Vision API

#38 rlskoeser closed 2 months ago
0
mark single author/single poems in PPA metadata

#37 jerielizabeth closed 1 month ago
1
transfer ECCO data from harddrives to SSD

#36 jerielizabeth closed 1 month ago
1
Get Laure the new data export with ESTCs from ECCO-TCP

#35 jerielizabeth closed 3 months ago
1
Outline CHR paper for OCR

#34 jerielizabeth closed 3 months ago
1
Sketch out the NLP experimental steps for PPA

#33 jerielizabeth closed 3 months ago
0
Release/0.1

#32 laurejt closed 4 months ago
0
Feature/scripts update

#31 laurejt closed 4 months ago
2
Scripts update & cleanup

#30 laurejt closed 4 months ago
1
Gather new poem-focused PPA test set in a spreadsheet

#29 mnaydan closed 4 months ago
1
Write a script to aggregate the results

#27 mnaydan closed 2 months ago
0
Get an image and text aligned dataset for the new test set

#26 mnaydan closed 4 months ago
1
Initial commit. Note that shared utility methods have been moved.

#25 laurejt closed 4 months ago
0
Augment PPA Character Stats with document frequency

#24 mnaydan closed 4 months ago
0
Create PPA page-level OCR quality evaluation

#23 mnaydan closed 4 months ago
1
Take John Foley's code out for a spin

#22 mnaydan closed 2 months ago
0
As an NLP expert, I want to assess the OCR quality of the pages in the test set so that I can offer a data-based recommendation on whether to re-OCR certain volumes.

#21 mnaydan opened 5 months ago
0
I want a character normalization strategy

#20 mnaydan opened 5 months ago
0
I want a list of UTF-8 characters in the corpus and their frequencies

#19 mnaydan closed 5 months ago
1
Investigate line length/word counts to see how far this method gets in detecting poetry

#18 mnaydan closed 2 months ago
0
revise the corppa filter script to work with the new stabilized unique work IDs for excerpts and articles

#17 mnaydan closed 1 week ago
1
As an NLP expert, I want to review the text corpus of the PPA test set, the existing preprocessing code, the decision log, and existing literature on OCR quality so that I can offer a recommendation on how the team should preprocess the text corpus moving forward.

#14 jerielizabeth closed 5 months ago
2
As a researcher, I want the option of applying OCR cleanup rules to the corpus so that my computational analysis will yield more accurate results.

#13 mnaydan closed 2 months ago
0
As a researcher, I want a list of possible NLP tools to try so that I can investigate their utility for answering our research questions.

#12 mnaydan closed 5 months ago
6
As a researcher, I want a test set of 20-25 works with definitions so that I can investigate methods for identifying direct and indirect citation.

#11 mnaydan closed 5 months ago
1
Write a script to get page images for the PPA dataset off the ECCO hard drives and into TigerData

#10 mnaydan closed 4 months ago
1
Investigate utility of DINOv2 on page images from the test set

#9 mnaydan closed 5 months ago
2
Functionality to filter/subset PPA full-text corpus by source id

#8 rlskoeser closed 6 months ago
5
initial setup for corppa python package

#7 rlskoeser closed 6 months ago
0
As a researcher, I want a set of 20-25 representative texts so that I can conduct controlled experiments on a sample corpus.

#6 mnaydan closed 5 months ago
6
As a researcher, I want to be able to get the text corpus for a subset of record IDs so that I can conduct textual analysis within particular groups of texts.

#5 mnaydan closed 6 months ago
5
Reading corpus from jsonl file exported by ppa-django + misc. more

#3 quadrismegistus closed 1 month ago
1
Develop

#1 quadrismegistus closed 6 months ago
4
pull page level genre metadata for HT volumes in the for-use-in-schools dataset

#4 mnaydan closed 2 months ago
2