sfluegel05 commented 6 months ago

Status quo

Data preprocessing is split into "raw" and "processed" according to the predetermined lightning-structure
"raw" contains the chebi.obo, classes.txt, train/test/val splits (unprocessed SMILES, labels)
"processed" contains encoded versions of train/test/val splits (SMILES processed)

Goal

Have 3 preprocessing stages:
- first stages only contains chebi.obo (raw)
- second stage contains data without split, but with labels attached (processed 1)
- third level contains encoded data (again without split) (processed 2)
Splits are created "on the fly"
- Test that they can be reproduced with some seed (compare hashes)
The file structure should represent this:
- Current file paths are data/ChEBIX/chebi_version/raw / data/ChEBIX/chebi_version/processed/encoding
- Instead, only take the parameters that are important for each step:
- raw: data/chebi_version/raw
- processed 1: data/chebi_version/ChEBIX/processed
- processed 2: data/chebi_version/ChEBIX/processed/encoding
  Things to keep in mind (for later implementations)
How can this work with cross-validation? -> it should be possible to get the same test set with different train/val splits
How to handle different versions of chebi, combinations of different training / test sets -> currently, this is handeled via additional files, should also be possible dynamically

aditya0by0 commented 3 months ago

Hi @sfluegel05, I have doubt regarding the issue. Do we have to implement the above restructuring only for chebi dataset or for all other datasets too.

sfluegel05 commented 3 months ago

This is only for the ChEBI datasets. The other datasets have their own structure. That should be adjusted as well at some point, but that would be a different issue

sfluegel05 commented 3 months ago

A special case for the data splits is the chebi_version_train:

Use case

You want to compare two models trained on different versions of ChEBI. In order to make a fair comparison, you need to evaluate both models on the same test set (and train them on training sets that don't overlap with this test set).

Tasks

[x] if chebi_version_train is set, create and process two datasets (one for the chebi_version, one for chebi_version_train)
[x] when creating splits, build the training and validation splits based on the chebi_version_train data, but using the test set from chebi_version
[x] build the test set as an adaption of the chebi_version test set that has all the same entries, but only the labels that also appear in the classes.txt of chebi_version_train
[x] test the implementation: classes ChEBIOver50(chebi_version=231) and ChEBIOver50(chebi_version=231, chebi_version_train=200) should have the same ids in their test sets (but different numbers of labels), the latter should also pass the test for no overlaps

Most of the functionality is already implemented for that, it just needs to be adapted to the dynamic data splits. In the end, no new files should be created for specific splits.

ChEB-AI / python-chebai

Data handling needs to be restructured #10

Status quo

Goal

Things to keep in mind (for later implementations)

Use case

Tasks