Closed wiederm closed 4 months ago
Could adding the offset to the energy degrade the precision of rh energy and hence the precision of our loss function, where we need to subtract two large energies and are looking for << 1 kcal/mol errors in energy differences between snapshots?
If the energies are always converted to double precision before the offset is added, there is likely no problem. But if this might be done in single precision, we may want a different approach for computing losses where the offsets have been already removed.
(Apologies for accidental close/open! The mobile version is easy to hit the wrong button!)
I agree with that concern. I am unsure about the best strategies, and I think we will have to empirically evaluate this (even if we use double precision).
My plan is to implement the transformation in this PR so that we can control the transformations independently in the computation graph. That will enable us to test the two scenarios: train on energies with and without offset.
Another possible scenario will be to test whether we want to normalize offset energies or not, again, something that we will have to determine empirically.
@yuanqing-wang , do you have a preference or further comments on this issue?
Just as a simple example of the issue: A druglike molecule might have a formation energy of around -1e6 kcal/mol. The smallest representable difference that float32 can represent is around 0.0625 kcal/mol if subtracting numbers of this magnitude. Unless promoted to double, this might also lead to lots of noise in gradients.
Bigger molecules (or molecules with heavier atoms, like halides) will have even more trouble.
@jchodera , I think removing the self energies will improve this situation considerably.
@chrisiacovella , did you already implement the self-energies for QM9 somewhere, or do you have opinions on where these should go? We can also discuss this offline later.
@chrisiacovella , did you already implement the self-energies for QM9 somewhere, or do you have opinions on where these should go? We can also discuss this offline later.
In the QM9 Dataset, they provided in an SI table thermochemical references for each of the atoms. E.g., here is the data for hydrogen:
"H": { "ZPVE": 0.000000 * unit.hartree, "U_0K": -0.500273 * unit.hartree, "U_298.15K": -0.498857 * unit.hartree, "H_298.15K": -0.497912 * unit.hartree, "G_298.15K": -0.510927 * unit.hartree, "Cv": 2.981 * unit.calorie_per_mole / unit.kelvin,
For each molecule, I calculate the reference values (i.e., just summing up all the individual atomic contributions) for each of the states/quantities (so reference energy at 0K, at 298.15K, ethanlpy at 298.15K, etc.)
Obviously, the most relevant for our needs is the 0K energy. The datafile includes:
internal_energy_at_0K
reference_energy_at_0K
formation_energy_at_0K
where the formation_energy_at_0K = internal_energy_at_0K - reference_energy_at_0K
In the spice_openff dataset, we also have:
dft_total_energy
reference_energy
formation_energy
where again, the formation_energy = dft_total_energy - reference_energy
The non openff level of theory spice has the formation_energy already in the hdf5 they distributed (but not the reference energy for each molecule); I used the same code that was in the downloader script used to generate the hdf5 file Peter posted on zenodo, so those values should be identical (and could be added in trivially, if we aren't going to use the formation_energy directly from the file).
ANI1x didn't provide the references that I could find in the publication (they may be in the code base but I haven't located them yet). The same level of theory is used as qm9, so those would probably be reasonable values to use.
To pick up the conversation from here: https://github.com/choderalab/modelforge/pull/63#pullrequestreview-1898911989
Currently, each model contains the used atomic self energies as model instance parameters, so these are saved whenever the model is saved. Also, we are writing a toml file containing them.
We accept the self energies of the datasets (in fact, these should be preferred) as dict here: https://github.com/choderalab/modelforge/blob/c17db775c648098adfc4f75cf9d4eaec2bbc9cb2/modelforge/dataset/dataset.py#L557 We should provide a method to obtain them for the dataset internally though. I will open an issue about this.
I will spin off some of the comments in new issues so we can merge this PR as soon as possible. New issues: https://github.com/choderalab/modelforge/issues/72 https://github.com/choderalab/modelforge/issues/73
And, as @chrisiacovella mentioned here, I have added a reference_ase dictionary for the QM9 dataset that we can use internally.
I added a AtomicSelfEnergies
dataclass, that takes a dictionary of {str, float} atomic self-energies and returns the atomic self energies both by element name and atomic index. I added it to the qm9 dataclass.
Description
It is common in training neural network potentials to perform pre-processing steps on the energies used for training. Two common strategies are (described in more detail here https://doi.org/10.1063/5.0160326):
Both pre-processing strategies require information passed from the dataset to the model, since the transformations must be reversed to obtain total energies.
This PR adds functionality to the dataset class to offset the total energies using atomic self-energies, calculated or provided for a specific level of theory and dataset, as well as the mean and variance of the total or pre-processed energies.
Self-energies
The self-energies are calculated using a linear regression fit if self-energies aren't provided. If total energies are required, the self-energies must be passed to the corresponding postprocessing module (
AddSelfEnergies
) to reverse the transformation.Normalization
Energies can be normalized by removing the average and dividing by the standard deviation. This operation can be requested by calling
To obtain the total energy both the substraction of the self-energies and the normalization need to be reverted. This can be done by passing both Postprocessing classes to the
ProstprocessingPipeline
:All of the Postprocessing classes can also be passed without
dataset.dataset_statistics
, in that case they will check if the dataset_statistics are already provided by the model (a trained model will provide these), and if not, raise an error.Todos
Status