This repository provides tooling for processing VOC inventories to
For both cases, two scripts exist respectively:
TANAP category
, as defined in the Tanap
class in label.py.There are two separate tasks defined in this repository:
For each task, there is a script to train a model:
See below for instructions on installing prerequisites and running the scripts.
Both produce a model file; run either script with the --help
argument for the specific arguments.
In order to apply a model as produced by the respective training script, call
As above, run any of the scripts with the --help
argument to get the specific usage.
curl -sSL https://install.python-poetry.org | python3 -
Or:
pipx install poetry
Als see Poetry documentation.
poetry install
To train a model run the scripts/train_model.py
script.
It downloads the necessary data from the HUC server into the local temporary directory.
Set your HUC credentials in the HUC_USER
and HUC_PASSWORD
environment variables or in settings.py
, and run the script.
HUC_USER=... HUC_PASSWORD=... poetry run python scripts/train_model.py
Without the credentials, the script is not able to download the inventories, but can proceed with previously downloaded ones.
Add the --help
flag to see all available options.
To extract the documents of one or more inventories using a previously trained model, use the scripts/predict_inventories.py
script, for instance:
poetry run python scripts/predict_inventories.py --model model.pt --inventory 1547,1548 --output 1547_1548.csv
Missing inventories are downloaded from the HUC server if the HUC_USER
and HUC_PASSWORD
environment variables are provided.
Add the --help
flag to see all available options.
This project uses
poetry install --with=dev
poetry run pre-commit install
poetry run pytest
Both document segmentation and classification are based on page embeddings -- defined in the PageEmbedding
class --, and region embeddings -- defined in the RegionEmbedding
class.
The models are implemented in the PageSequenceTagger
and the DocumentClassifier
class respectively, both are sub-classes of the AbstractPageLearner
class (see diagram below).
These classes are used for document boundary detection and document type classification respectively.
The Inventory
class is the main data class.
It holds sequences of pages and labels, and is inherited by the Document
class, for using different labels.
The Sheet
class and its sub-classes are used for reading and processing the annotated data from CSV/Excel sheets as stored in the annotations directory.
(Hyper-)parameters like layer sizes and language model are defined in settings.py.
Run this command for updating the classes diagram:
poetry run pyreverse --output svg --colorized document_segmentation