In issue #10 we introduced a new file structure for the ChEBI datasets. To help users in transfering their data into this new structure, we need a migration script that automates this step.
For the most part, this should be relatively easy - taking files from one directory and copying them to another directory. The splits are of course a bit more difficult. If we want users to be able to continue their current splits, this requires a new features: setting datasplits based on a list of ids.
The latter would also have the advantage that we can circumvent the performance issue (#32) by saving the configuration of the current split as a list of ids and reload the splits from this list. (This might look like a step back, but importantly, we do not save the splits as separate files. The standard method of creating splits via a seed stays intact.)
Solution
The behaviour in the end should be:
When initialising a dataset, the user has the option to provide a file path to csv file that contains a list of chebi ids and their assignment to a dataset (either train, validation or test). Then, instead of creating a new split, the provided split will be used
When initialising the dataset without providing such a file, the splits will get created automatically (as before) and the resulting split is saved as a csv file
When running the migration script, the chebi data files will be copied into the new structure. For the splits, the split files are combined into one file and a csv file for the split assignment will be created in addition.
Problem
In issue #10 we introduced a new file structure for the ChEBI datasets. To help users in transfering their data into this new structure, we need a migration script that automates this step. For the most part, this should be relatively easy - taking files from one directory and copying them to another directory. The splits are of course a bit more difficult. If we want users to be able to continue their current splits, this requires a new features: setting datasplits based on a list of ids. The latter would also have the advantage that we can circumvent the performance issue (#32) by saving the configuration of the current split as a list of ids and reload the splits from this list. (This might look like a step back, but importantly, we do not save the splits as separate files. The standard method of creating splits via a seed stays intact.)
Solution
The behaviour in the end should be: