QM Dataset energies - Githubissues

Description

Atomic self-energies

We had a few discussions about the best way to train a neural network potential on QM energies without loss of precision and numeric instabilities.

I am proposing the following approach (I have implemented this already in the train PR, but revising this discussion to make it clear for everyone involved in a separate PR seems appropriate): In the first preprocessing step, we calculate (using regression) or obtain (either from the user, or from the dataset if provided there) the self-atomizing energies of each element E_ase. This is then passed to the neural network and will be used to calculate the atomic energy E_i as E_i = E_i_pred + E_element_ase. That makes E_element_ase a parameter of each trained neural network that will be stored with the model.

We have two scenarios:

training scenario: The loss is calculated on the sum of E_i, the total energy E_total_predict and the E_label_without_ase provided by the dataset. In such a scenario, E_element_ase is not added to E_i.
inference scenario: E_element_ase is added

Normalization

We can also normalize E to help with training. Currently, we are calculating the mean and the standard deviation of E_label (for this calculation, the self-energies are removed). We can then scale to a unit interval.

In practice, this will mean that for a given QM dataset we obtain E_scaling_mean and E_scaling_stddev, and the total energy we predict is E = E_total_predict + E_scaling_mean * E_scaling_stddev. This will make especially sense if the value of E_i is restricted by a hyperbolic tangent or sigmoid activation function.

Note: there is an argument that we can immediately train on E_label (including the atomic self-energies) using such an energy expression. And that is true, but when we remove the atomic self-energies on the QM dataset, we operate on float64, while during training we are in float32. This loss of precision is relevant for larger training set molecules.

Parameter initialization

Difference between training and inference stage

The neural network potential behaves differently in these two stages. During inference, we want to predict the total energy (that corresponds to the QM energy); during training, we want to match the E_label provided by the dataset (that might represent the QM energy after some transformation).

Currently, we will match the QM energy if we provide values that have been used for the transformation

model.dataset_statistics = dataset.dataset_statistics

If, e.g., self-energies are not provided, these won't be added; if 'scaling_mean' and 'scaling_stddev' are not provided, these are set to 1 and 0, respectively. There might be a cleaner way to control this behavior, but I think it is fine for the moment.

Todos

Notable points that this PR has either accomplished or will accomplish.

[x] Add tests to cover each scenario

Status

[x] Ready to go

choderalab / modelforge

QM Dataset energies #77