Closed kkovary closed 6 months ago
Hi @kkovary,
Thank you for your question!
Similar to how Scikit-learn does this, you can achieve this by simply splitting the train set again. For example:
import datamol as dm
from splito import ScaffoldSplit
# Load some data
data = dm.data.chembl_drugs()
all_smiles = data["smiles"].tolist()
# Generate the trainval-test split
splitter = ScaffoldSplit(smiles=all_smiles, test_size=0.2)
trainval_idx, test_idx = next(splitter.split(X=all_smiles))
# Generate the train-val split
trainval_smiles = all_smiles[trainval_idx]
splitter = ScaffoldSplit(smiles=trainval_smiles, test_size=0.2)
train_idx, val_idx = next(splitter.split(X=trainval_smiles))
I do agree that with our current setup, this is a bit verbose. Having something like #8 would probably help to make this easier!
Let me know if that helps!
Thanks so much!
As far as I can tell, the splitters will only do train/test splits. It would be really useful to allow for a third validation split.