Princeton-CDH / ppa-nlp

Discovering patterns in poetry’s data with machine learning; software for use with Princeton Prosody Archive (PPA) full-text corpus
1 stars 0 forks source link

update baseline Google Vision OCR script with new parameters #63

Closed mnaydan closed 1 month ago

mnaydan commented 2 months ago

For consistency, use argparse for the parameters; directory/images can use pathlib objects instead of strings.

For input image selection: I think this script should be agnostic about ppa ids and should just be run from a directory and create the same structure in the specified output directory. My inclination is to make the path input an n-args option so we can provide a number of directories at once, and then we use glob to find images nested anywhere under the specified paths. Currently version of the script looks for .jpg; I suggest for now we just look for whatever extension the Gale TIFF images use, and make this configurable later on if we decide to use for other content.

Script should be set up like the filter script, so you can run it from the command line locally or from the main method via installed package.

Current script logic is not to regenerate ocr if the output files are detected at the expected location; that seems reasonable to me to keep. We might want an override/regenerate option later on but let's not add it until we know we need it.

mnaydan commented 1 month ago

Decision to tidy up the tests/code and write 1 additional test for logic Laure was concerned about.