ChEB-AI / python-chebai

GNU Affero General Public License v3.0
11 stars 4 forks source link

Data migration / setting fixed splits #34

Closed sfluegel05 closed 1 month ago

sfluegel05 commented 2 months ago

Problem

In issue #10 we introduced a new file structure for the ChEBI datasets. To help users in transfering their data into this new structure, we need a migration script that automates this step. For the most part, this should be relatively easy - taking files from one directory and copying them to another directory. The splits are of course a bit more difficult. If we want users to be able to continue their current splits, this requires a new features: setting datasplits based on a list of ids. The latter would also have the advantage that we can circumvent the performance issue (#32) by saving the configuration of the current split as a list of ids and reload the splits from this list. (This might look like a step back, but importantly, we do not save the splits as separate files. The standard method of creating splits via a seed stays intact.)

Solution

The behaviour in the end should be: