Allow switching of models with optional remote model repository

cboulanger commented 2 years ago

In order to be able to use specialized models for different kind of scholarly citation patterns, we should make the directory containing model data (now EXparser/Utils) configurable. The idea is to give such a specialized model a unique name which serves both as an well-known id and the name of the directory in which the models are stored. Since the model data is directly dependent on the training code, it needs to be versioned. This also allows to run tests comparing the performance of a particular model with the same id but different versions (for example, by running an evaluation comparing performance of different git branches).

[x] Models is stored in EXparser/Models/<version>/<model_id>. EXparser/Utils/ is renamed to EXparser/Models/<version>/default. The version number is hardcoded in configs.py and manually incremented whenever a change is made in the EXparser code that renders the model data backward-incompatible to previous code versions.
[x] Since different models will have different training material (the whole point of having separate models), EXparser/Dataset needs to be renamed to EXparser/Datasets/default. The training material folders do not need to be versioned.
[x] A new commanddocker run ... excite_toolchain create_model <model_id> is added which creates a directory EXparser/Models/<version>/<model id> and copies over the non-reproduceable model files (if there are any left). It returns a message saying that the user needs to add training material to EXparser/Datasets/<model_id> and to run training.
[x] The model is selected when running the docker commands, such as docker run ... excite_toolchain exparser <model_id>. If no model name is supplied, "default" is used as the model id. If the model id does not exist, an error is raised saying that the command create_model must be run first.
[x] docker run ... excite_toolchain (segmentation|extraction)_training <model_id> computes the models from the training material in EXparser/Datasets/<model_id> and places them into EXparser/Models/<version>/<model_id>.

When we have this system in place, an optional storage system can be build upon it. It works with packages that are a ZIP of the training material and model data stored in a configurable location.

[x] A new commanddocker run ... excite_toolchain download_model <model_id> is added which tries to download /excite-docker/<version>/<model_id>.zip from a WebDAV server (url and credentials are supplied as environment variables). If that is successful, the ZIP is extracted and placed into the training and model directories corresponding to the version and model id.
[x] A new command docker run ... excite_toolchain upload_model <model_id> is added, which uploads the training and model data as a ZIP to the WebDAV folder
[x] A new command docker run ... excite_toolchain list_models is added, which returns a list of models stored at the given repository compatible with the current version

cboulanger commented 2 years ago

Alternatively, instead of allowing to use non-existent model ids, and implicitly creating new dirs, a separate command create_model could be used that explicitly creates a new directory. Probably better to raise errors if non-existent ids are used.

cboulanger commented 2 years ago

Rewrote the proposal according to my last comment to not do any implicit magic. Instead, creation and downloading needs to be done explicitly and errors should be thrown if model ids do not exist.

cboulanger commented 2 years ago

Done in https://github.com/cboulanger/excite-docker/tree/add_model_storage

cboulanger / excite-docker

Allow switching of models with optional remote model repository #4