Release the krakens
(RTK) is meant as a task management scripting library focused on retrieving data from online
repositories and on applying a series of annotation (Segmentation with YALTAi, Clean Up, Kraken) while keeping the disk
space usage low (with some clean up function).
It provides few main classes which can be used together (see example.py
).
This is currently not perfectly optimized: technically, CPU/Network bound tasks such as downloading task could have callbacks to run GPU tasks before they are completely done...
If you want run the script locally, run pip install -r requirements.txt
.
If you want to run the demo files, run quickyaltaiinstall.sh
. Models are in the early alpha release.
See HowTo for a nice decision tree on how to build your own script.
See example.py
which uses manifests, keeps the xml and produces TEI files.
It takes a file with a list of manifests to download from IIIF (See manifests.txt) and passes it in a suit of commands:
The batch file should be lower if you want to keep the space used low, specifically if you use DownloadIIIFManifest.
Task
A Task is defined by three main functions and one main property. See Task
in rtk.task.py
.
._checked_files
is a private property which is used to pass information about items which were
processed. Its keys are the input values of the Task
, their associated value is a boolean indicating if this was
processed. It should not be accessed externally !.check()
returns a boolean indicating if everything was treated or not. .check()
has the responsability
to fill boolean values of ._checked_files
._process(inputs)
treats the inputs files (like downloading, parsing, annotating)@property .output_files
provides a list with all items which needs to be passed to the next TaskTask can have custom parameters, check rtk.task.py