atomistic-machine-learning / G-SchNet

G-SchNet - a generative model for 3d molecular structures
MIT License
129 stars 24 forks source link

How to choose max_dist? #2

Closed wxx07 closed 4 years ago

wxx07 commented 4 years ago

I wonder how you choose max_dist in collate_atoms func : https://github.com/atomistic-machine-learning/G-SchNet/blob/74cea9b00bf27e62f7c324e4e6f3a7b6f1f45e23/utility_functions.py#L589-L596 Is max_dist based on statistics from QM9 dataset, such as the maximum observed distance, or It is quite casual and can be tuned? In the following lines, it seems there may be entry in distance map greater than max_dist: https://github.com/atomistic-machine-learning/G-SchNet/blob/74cea9b00bf27e62f7c324e4e6f3a7b6f1f45e23/utility_functions.py#L503-L505 Could you help me clear this up?

NiklasGebauer commented 4 years ago

Hey,

sorry for the late answer! For the QM9 molecules, there shouldn't be many with larger pairwise distances than 15 Angström, so it's a pretty loose cutoff chosen by roughly looking at the data. However, it could also be set to a smaller or larger value.

In general, this is the maximum distance covered in the (binned) distance distributions predicted by the model. The two parameters max_dist and n_bins define the resolution of the 1d grid used when predicting distance distributions (i.e. in this case 300 bins between 0 and 15 Angström). You could play with these parameters to increase or decrease the resolution in order to make your model more precise or smaller in number of parameters.

In the current implementation, the model will just learn to predict max_dist for all distances larger than max_dist and distances are clipped to that value during generation (so everything farther away is set to max_dist during generation).

In a proper (future) implementation, the model should only make predictions for atoms that are closer to the current focus than max_dist. Then the parameter will be even more important and the implementation will scale to larger molecules with many more atoms than in QM9.

Does this clear everything up?

wxx07 commented 4 years ago

Thank you for your reply :)

As you mentioned its effect in generation, I guess the proper value for max_dist may be the one (slightly) larger than observed maximum pairwise distances in dataset, like 15 Angström for QM9. So the trained model would not predict too many atoms outside range of max_dist, which may bring uncertainty to determine the location of predicted atoms. Is that right?

NiklasGebauer commented 4 years ago

Yes, that's how you can handle it in the current implementation! However, I am not sure if it would introduce too much uncertainty if you keep it at 15 Angström for data with larger molecules since a lot of atoms would still be closer than max_dist. But I never really tried if too much noise is added for larger molecules. Did you run into problems with larger molecules?

In the future, one should just implement a suitable cutoff such that predictions of atoms farther away than max_dist are not considered during generation (or, for memory purposes, the predictions should not even be calculated for those atoms).

wxx07 commented 4 years ago

No, I have not tried 15 Angström in the case of larger molecules yet. I will give it a try. That really clears things up. Thanks!