Closed sfluegel05 closed 1 month ago
Hi @sfluegel05, I have doubt regarding the issue. Do we have to implement the above restructuring only for chebi dataset or for all other datasets too.
This is only for the ChEBI datasets. The other datasets have their own structure. That should be adjusted as well at some point, but that would be a different issue
A special case for the data splits is the chebi_version_train
:
You want to compare two models trained on different versions of ChEBI. In order to make a fair comparison, you need to evaluate both models on the same test set (and train them on training sets that don't overlap with this test set).
chebi_version_train
is set, create and process two datasets (one for the chebi_version
, one for chebi_version_train
)chebi_version_train
data, but using the test set from chebi_version
chebi_version
test set that has all the same entries, but only the labels that also appear in the classes.txt
of chebi_version_train
ChEBIOver50(chebi_version=231)
and ChEBIOver50(chebi_version=231, chebi_version_train=200)
should have the same ids in their test sets (but different numbers of labels), the latter should also pass the test for no overlapsMost of the functionality is already implemented for that, it just needs to be adapted to the dynamic data splits. In the end, no new files should be created for specific splits.
Status quo
chebi.obo
,classes.txt
, train/test/val splits (unprocessed SMILES, labels)Goal
chebi.obo
(raw)data/ChEBIX/chebi_version/raw
/data/ChEBIX/chebi_version/processed/encoding
data/chebi_version/raw
data/chebi_version/ChEBIX/processed
data/chebi_version/ChEBIX/processed/encoding
Things to keep in mind (for later implementations)