[x] update script to add and use input parameters for:
[x] path / selection for images to process *
[x] output base directory
[x] optional max number of images to process
[x] add configuration in pyproject.toml to configure the script to be installed when the package is installed
[x] add google cloud python package to pyproject dependencies (maybe a new optional dependency group?)
[x] identify if there are any methods in the script that can be usefully unit tested and write tests
[x] document how to run the script
For consistency, use argparse for the parameters; directory/images can use pathlib objects instead of strings.
For input image selection: I think this script should be agnostic about ppa ids and should just be run from a directory and create the same structure in the specified output directory. My inclination is to make the path input an n-args option so we can provide a number of directories at once, and then we use glob to find images nested anywhere under the specified paths. Currently version of the script looks for .jpg; I suggest for now we just look for whatever extension the Gale TIFF images use, and make this configurable later on if we decide to use for other content.
Script should be set up like the filter script, so you can run it from the command line locally or from the main method via installed package.
Current script logic is not to regenerate ocr if the output files are detected at the expected location; that seems reasonable to me to keep. We might want an override/regenerate option later on but let's not add it until we know we need it.
For consistency, use argparse for the parameters; directory/images can use
pathlib
objects instead of strings.For input image selection: I think this script should be agnostic about ppa ids and should just be run from a directory and create the same structure in the specified output directory. My inclination is to make the path input an n-args option so we can provide a number of directories at once, and then we use glob to find images nested anywhere under the specified paths. Currently version of the script looks for .jpg; I suggest for now we just look for whatever extension the Gale TIFF images use, and make this configurable later on if we decide to use for other content.
Script should be set up like the filter script, so you can run it from the command line locally or from the main method via installed package.
Current script logic is not to regenerate ocr if the output files are detected at the expected location; that seems reasonable to me to keep. We might want an override/regenerate option later on but let's not add it until we know we need it.