Goal

Have 3 preprocessing stages:
- first stages only contains chebi.obo (raw)
- second stage contains data without split, but with labels attached (processed 1)
- third level contains encoded data (again without split) (processed 2)
Splits are created "on the fly"
- Test that they can be reproduced with some seed (compare hashes)
The file structure should represent this:
- Current file paths are data/ChEBIX/chebi_version/raw / data/ChEBIX/chebi_version/processed/encoding
- Instead, only take the parameters that are important for each step:
- raw: data/chebi_version/raw
- processed 1: data/chebi_version/ChEBIX/processed
- processed 2: data/chebi_version/ChEBIX/processed/encoding

A special case for the data splits is the chebi_version_train:

Use case

You want to compare two models trained on different versions of ChEBI. In order to make a fair comparison, you need to evaluate both models on the same test set (and train them on training sets that don't overlap with this test set).

Tasks

[x] if chebi_version_train is set, create and process two datasets (one for the chebi_version, one for chebi_version_train)
[x] when creating splits, build the training and validation splits based on the chebi_version_train data, but using the test set from chebi_version
[x] build the test set as an adaption of the chebi_version test set that has all the same entries, but only the labels that also appear in the classes.txt of chebi_version_train
[x] test the implementation: classes ChEBIOver50(chebi_version=231) and ChEBIOver50(chebi_version=231, chebi_version_train=200) should have the same ids in their test sets (but different numbers of labels), the latter should also pass the test for no overlaps

Most of the functionality is already implemented for that, it just needs to be adapted to the dynamic data splits. In the end, no new files should be created for specific splits.