aiqm / torchani

Accurate Neural Network Potential on PyTorch
https://aiqm.github.io/torchani/
MIT License
460 stars 127 forks source link

Question about training/validation split & shuffle #578

Closed shubbey closed 3 years ago

shubbey commented 3 years ago

Consider two implementations of the training routine:

training, validation = 
        torchani.data.load('data.h5').subtract_self_energies(...).species_to_indices(...).shuffle().split(0.8,None)

vs

training = torchani.data.load('train.h5').subtract_self_energies(...).species_to_indices(...).shuffle()
validation = torchani.data.load('validate.h5').subtract_self_energies(...).species_to_indices(...)
# where train.h5 is a random sample of 80% of the molecules in data.h5, and validate.h5 is the remaining 20%)

The first implementation gives me very good results (converges to an RMSE of 1.6 KJ/mol over my large training set) The second, much worse (about 10 kJ/mol) The reason I initially went with the second approach is that I thought it would be better to ensure that the training and validation sets were comprised of unique molecules. In the first implementation, I don't believe this is the case because the loader routine (in data/init.py) returns conformers of molecules before shuffling and splitting, meaning that two different conformers of the same molecule could be present in training and validation sets.

Now it may be the case that training and validating on different conformers of the same molecule is good and intended. When I run this fully trained implementation against a test set of independent molecules, it does very well-- the RMSE is worse than the validation set but much better than the second implementation.

So I am wondering if I am doing something wrong here or fundamentally misunderstanding how this works. I'm also a little confused as to how the shuffle() works here. Is the data only shuffled one time and then kept in that order each epoch, or is there more to it? Is my second implementation missing something? Thank you for any help you can provide!

farhadrgh commented 3 years ago

What are the arguments in subtract_self_energies(...)? There might be a mixed-up in computing self atomic energies (SAE) in your second method. It computes the SAEs every time you load the data given energy_shifter = torchani.utils.EnergyShifter(None). I suggest storing the SAEs in a dict and use that when loading the validation:

energy_shifter = torchani.utils.EnergyShifter(None)
training,_ = torchani.data.load('train.h5').subtract_self_energies(...).species_to_indices(...).shuffle()
# Transfer training SAEs to other data loaders
sae_dict = {s: e for s, e in zip(species_order, energy_shifter.self_energies.tolist())}
validation = torchani.data.load('validate.h5').subtract_self_energies(sae_dict).species_to_indices(...)
shubbey commented 3 years ago

Thank you for the advice. I initially responded here with some erroneous info. Trying your method now!

Also, would it make sense to shuffle the training data before each epoch? I noticed that it keeps the data in the same order each pass through.

shubbey commented 3 years ago

Unfortunately, the energy_shifter method isn't the issue here, but it's good to know how it works.

After running several tests, I confirmed that the single biggest factor for network improvement is the mixing of training and validation conformers. That is:

    training, validation = torchani.data.load('full.h5')\
                                        .subtract_self_energies(energy_shifter,species_order)\
                                        .species_to_indices(species_order)\
                                        .cache()\
                                        .split(0.8, None)
    training.shuffle()

== bad result,

    training, validation = torchani.data.load('full.h5')\
                                        .subtract_self_energies(energy_shifter,species_order)\
                                        .species_to_indices(species_order)\
                                        .shuffle()\
                                        .split(0.8, None)

== good result

Note that my dataset, 'full.h5', is pre-shuffled with all of the molecule data, with each molecule having ~10-100 conformers. In the first case, training and validation won't each contain different conformers of the same molecule. In the second, there will be a lot of overlap because the shuffle() will split conformers of the same molecule between the two sets.

The result isn't surprising, since you'd expect the validation set to have a better RMSE if it contains molecules more similar to those in the training set (and similar in this case would be different conformers of the same molecule).

However, what puzzles me is that you'd expect this to produce an overfit and not give good results in an independent testing set (where all molecules/conformers are unique). However, it does much better than the results in the first example, because that network converges quickly to a poor result.

Do you know if this is by design or did I stumble into an involuntary "optimization"? The reason I am curious is that I am also running an NMR network that has similar input (since each NMR datapoint has several AEV vectors from the same molecule), and it may make sense to split these up between training and validation.

Thanks again for all of the help.

zasdfgbnm commented 3 years ago

From the paper https://pubs.rsc.org/en/content/articlelanding/2017/sc/c6sc05720a#!divAbstract

Using this procedure to generate the ANI-1 data set results in molecular energies for a total of ∼17.2 million conformations generated from ∼58k small molecules. For each molecule's individual set of random conformations, 80% is used for training, while 10% is used for each validation and testing of the ANI-1 model.

So I think the procedure actually is different conformers of the same molecule goes to both training and validation? @Jussmith01 to confirm.

shubbey commented 3 years ago

Good find! I would be very interested in the methodology behind this. In my numerous tests it seems that training and validating on conformers of the same molecule definitely leads to much better convergence. Thank you!

IgnacioJPickering commented 3 years ago

@shubbey @zasdfgbnm I can confirm that the current method is to add conformers of the same molecule into both training and validation sets. I think the potential can only be expected to generalize to a fully disjoint set of molecules if it has been trained on an extremely large training set, by splitting the training and validation conformers it is possible to end up overfitting to the training conformers and if the training and validation conformers are very heterogeneous subsets then the validation error will be large

Jussmith01 commented 3 years ago

@shubbey Sorry for the late reply. In the original ANI-1 paper there was some probability that a molecule may be excluded from the training set, but in general, there was nothing forcing it. Imagine putting all conformations in the data set in one big bucket, then randomly selecting 80%. There is a chance you could end up not including any conformations from a specific configuration.

In any case, @IgnacioJPickering is correct. We have always relied on very large datasets with a lot of configurational diversity to develop models that are transferable in that space. We also validate our models on completely unseen data sets to ensure there is minimal overfitting. This is why we never focus on validation error or held out test error, but rather on systems that may be more interesting (e.g. large drug molecules) and are not included in the training set. Our work on ANI-1x [https://aip.scitation.org/doi/abs/10.1063/1.5023802] and ANI-2x [https://pubs.acs.org/doi/abs/10.1021/acs.jctc.0c00121] shows the importance of chemical diversity in training and testing outside of the space of things you train on. See the COMP6 benchmark in those papers. Also, one of our more recent publications [https://www.nature.com/articles/s41467-021-21376-0] takes the idea of configurational diversity to a new level, albeit in the materials world.

As for your specific issue. Without knowing more about your data set (size, # of configurations, how configurations were selected, etc.) I do not have much to add. I would suggest some form of active learning for data set construction since human bias can impact transferability.

shubbey commented 3 years ago

Thanks for the responses, guys. It makes sense.