Closed cboulanger closed 2 years ago
Alternatively, instead of allowing to use non-existent model ids, and implicitly creating new dirs, a separate command create_model
could be used that explicitly creates a new directory. Probably better to raise errors if non-existent ids are used.
Rewrote the proposal according to my last comment to not do any implicit magic. Instead, creation and downloading needs to be done explicitly and errors should be thrown if model ids do not exist.
In order to be able to use specialized models for different kind of scholarly citation patterns, we should make the directory containing model data (now
EXparser/Utils
) configurable. The idea is to give such a specialized model a unique name which serves both as an well-known id and the name of the directory in which the models are stored. Since the model data is directly dependent on the training code, it needs to be versioned. This also allows to run tests comparing the performance of a particular model with the same id but different versions (for example, by running an evaluation comparing performance of different git branches).EXparser/Models/<version>/<model_id>
.EXparser/Utils/
is renamed toEXparser/Models/<version>/default
. The version number is hardcoded inconfigs.py
and manually incremented whenever a change is made in the EXparser code that renders the model data backward-incompatible to previous code versions.EXparser/Dataset
needs to be renamed toEXparser/Datasets/default
. The training material folders do not need to be versioned.docker run ... excite_toolchain create_model <model_id>
is added which creates a directoryEXparser/Models/<version>/<model id>
and copies over the non-reproduceable model files (if there are any left). It returns a message saying that the user needs to add training material toEXparser/Datasets/<model_id>
and to run training.docker run ... excite_toolchain exparser <model_id>
. If no model name is supplied, "default" is used as the model id. If the model id does not exist, an error is raised saying that the commandcreate_model
must be run first.docker run ... excite_toolchain (segmentation|extraction)_training <model_id>
computes the models from the training material inEXparser/Datasets/<model_id>
and places them intoEXparser/Models/<version>/<model_id>
.When we have this system in place, an optional storage system can be build upon it. It works with packages that are a ZIP of the training material and model data stored in a configurable location.
docker run ... excite_toolchain download_model <model_id>
is added which tries to download/excite-docker/<version>/<model_id>.zip
from a WebDAV server (url and credentials are supplied as environment variables). If that is successful, the ZIP is extracted and placed into the training and model directories corresponding to the version and model id.docker run ... excite_toolchain upload_model <model_id>
is added, which uploads the training and model data as a ZIP to the WebDAV folderdocker run ... excite_toolchain list_models
is added, which returns a list of models stored at the given repository compatible with the current version