datamol-io / splito

Machine Learning dataset splitting for life sciences.
https://splito-docs.datamol.io/
Apache License 2.0
23 stars 2 forks source link

Support train/test/validation splitting #11

Closed kkovary closed 6 months ago

kkovary commented 6 months ago

As far as I can tell, the splitters will only do train/test splits. It would be really useful to allow for a third validation split.

cwognum commented 6 months ago

Hi @kkovary,

Thank you for your question!

Similar to how Scikit-learn does this, you can achieve this by simply splitting the train set again. For example:

import datamol as dm
from splito import ScaffoldSplit

# Load some data
data = dm.data.chembl_drugs()
all_smiles = data["smiles"].tolist()

# Generate the trainval-test split
splitter = ScaffoldSplit(smiles=all_smiles, test_size=0.2)
trainval_idx, test_idx = next(splitter.split(X=all_smiles))

# Generate the train-val split
trainval_smiles = all_smiles[trainval_idx]
splitter = ScaffoldSplit(smiles=trainval_smiles, test_size=0.2)
train_idx, val_idx = next(splitter.split(X=trainval_smiles))

I do agree that with our current setup, this is a bit verbose. Having something like #8 would probably help to make this easier!

Let me know if that helps!

kkovary commented 6 months ago

Thanks so much!