JacksonBurns / astartes

Better Data Splits for Machine Learning
https://jacksonburns.github.io/astartes/
MIT License
60 stars 2 forks source link

[QUESTION]: It appears that scaffold splits always return the same splits #163

Closed PatWalters closed 10 months ago

PatWalters commented 11 months ago

When I use the code below, I get the same splits regardless of random_state i = 0 res = train_test_split_molecules(molecules=df.SMILES.values, return_indices=True, sampler='scaffold',random_state=i)

It might be useful to adopt the convention used by ChemProp, which enables different scaffold splits based on random_state

https://github.com/chemprop/chemprop/blob/master/chemprop/data/scaffold.py

PatWalters commented 11 months ago

A related followup. I've been doing some experiments, and it appears that random, kmeans, and optsim respect the random_state variable, but scaffold, dbscan, and sphere_exclusion do not. As such, it seems like the first three would be useful for examining multiple cross validation folds, but the second three would not.

Also, it seems like the behavior of train_test_split is different from the scikit_learn_implementation. In sklearn, if random_state is not specified, it is assigned at random. In your implementation, if random_state is unspecified, it's set to a default value, correct?

kspieks commented 11 months ago

Hi @PatWalters! Thanks so much for trying out our package! We appreciate your questions and will try to get to the bottom of this.

Regarding your first question, we agree with you and also like Chemprop's implementation so our code only slightly modifies theirs (Chemprop fills the training set first while we fill the testing set first). We use the random seed to shuffle the clusters for all extrapolative sampling methods so the random state should make a difference. This happens during the call to get_sorted_cluster_counter https://github.com/JacksonBurns/astartes/blob/main/astartes/main.py#L185 which ultimately runs this line https://github.com/JacksonBurns/astartes/blob/main/astartes/samplers/abstract_sampler.py#L101

However, if the dataset is comprised of only a few clusters/scaffolds rather than many small sets of clusters/scaffolds, I can see how this could cause the behavior you're observing. As a simple conceptual example, if a dataset had only 3 clusters that comprised 50%, 30%, and 20% of the molecules, no amount of shuffling would change that 50% would go to train, 30% would go to val, and 20% would go to test (or however the user requested the split). In this case, the random state would appear to do nothing. One hypothesis is that the default hyperparameters for dbscan and sphere_exclusion are not creating enough small clusters for shuffling to have an impact.

Could you please share the dataset you were testing this with? I'd be happy to investigate this further to make sure our package is giving the expected behavior.

Edit: I now see the notebook you had shared with me. Thanks for the organized code! I'll investigate this after the Thanksgiving holiday and get back with you.

kspieks commented 11 months ago

For your second question, yes that's correct. In the current state of our package, if the random seed is not explicitly specified by the user, we assign it to the same default value. There's pros and cons to this approach. The pro is that this approach prioritizes reproducibility by default. So even if someone is just quickly testing an idea and doesn't think to specify a random seed, they should get the same splits (and thus same ML model results), which should hopefully enable easier comparison and benchmarking. The con is that users would have to explicitly specify n different random states to obtain n different splits, such as for cross validation.

Thanks again for your questions! Please feel encouraged to share additional thoughts and feedback if we can make the package more intuitive.

PatWalters commented 11 months ago

Here's an example.
https://gist.github.com/PatWalters/c64d59c262ba1993d20af815cd5e0b5a There are more than 30 Murcko scaffolds in the set.

JacksonBurns commented 10 months ago

Thanks @PatWalters for the thorough report and @kspieks for jumping in on this.

Kevin has thoroughly covered everything in the original question, the one thing I will add is to reiterate that we do set the random_state by default on every run (42, for the curious) to (1) force reproducibility and (2) encourage users to be very deliberate with their random seeding!

I'm going to review the linked PR and set it to automatically close this issue when merged, since I think should fix this issue by better defining the behavior. @PatWalters if the original question hasn't been answered or another one arises, please feel free to re-open this issue or start a new one!

kspieks commented 10 months ago

Hi @PatWalters,

Thanks again for trying our package and identifying this bug that was affecting the train_test_split() function. train_val_test_split() was working fine but we accidentally didn't write a unit test for train_test_split(). Our apologies for that. After merging PR #164, both the train_test_split() and train_val_test_split() functions should work as intended and our unit test coverage is 97%. The new version has been automatically pushed to pypi and Zenodo so users should no longer encounter the issue raised here.

I re-ran your notebook with the latest main branch and it seems to give the expected behavior. I double checked that the generated train and test indices are unique. The notebook is attached below.

Let us know if you have additional feedback! Our goal is to provide an easy interface to several useful data splitting techniques so ML researchers can evaluate their models in both interpolative and extrapolative settings.

astartes_test_v2-Copy1.ipynb.zip

PatWalters commented 10 months ago

Thanks for the quick response and the fix. I really like the package. From a selfish perspective, I'd love to see more molecule-related splitting strategies. Hierarchical splits SIMD splits MOOD splits Neighbor splits (moving the compounds with the least number of neighbors to the test set)

JacksonBurns commented 10 months ago

We've found that sometimes the molecule-oriented splitters can have interesting effects on non-molecule data, so it's hardly selfish! Thanks for these suggestions!