Pre-processing dataset energies

wiederm commented 5 months ago

Description

It is common in training neural network potentials to perform pre-processing steps on the energies used for training. Two common strategies are (described in more detail here https://doi.org/10.1063/5.0160326):

offsetting the molecular energies by the atomize energies (self energies of the free atoms in vacuum) to avoid numerical instability with the typical considerable total energies
scaling the range of energies to a unit interval

Both pre-processing strategies require information passed from the dataset to the model, since the transformations must be reversed to obtain total energies.

This PR adds functionality to the dataset class to offset the total energies using atomic self-energies, calculated or provided for a specific level of theory and dataset, as well as the mean and variance of the total or pre-processed energies.

 from modelforge.dataset.qm9 import QM9Dataset
 from modelforge.dataset.utils import FirstComeFirstServeSplittingStrategy

 data = QM9Dataset(for_unit_testing=True)
 dataset = TorchDataModule(
    data, batch_size=8, split=FirstComeFirstServeSplittingStrategy()
    )

# self energy is calculated and removed in prepare_data if `remove_self_energies` is True
dataset.prepare_data(remove_self_energies=True, normalize=False)
# alternatively, if we know the self energies for a dataset it can be provided by passing a dictionary with the atomic number as key a the self-energies as corresponding values
self_energies : {1 : -10., 6 : -1000}
dataset.prepare_data(remove_self_energies=True, normalize=False, self_energies=self_energies)

Self-energies

The self-energies are calculated using a linear regression fit if self-energies aren't provided. If total energies are required, the self-energies must be passed to the corresponding postprocessing module (AddSelfEnergies) to reverse the transformation.

model = SchNET(
    embedding_module=embedding,
    nr_interaction_blocks=nr_interaction_blocks,
    radial_basis_module=rbf,
    cutoff_module=cutoff,
    nr_filters=nr_filters,
    postprocessing=PostprocessingPipeline(
        [AddSelfEnergies(dataset.dataset_statistics)]
    ),
)

Normalization

Energies can be normalized by removing the average and dividing by the standard deviation. This operation can be requested by calling

# self energy is calculated and removed in prepare_data if `remove_self_energies` is True
dataset.prepare_data(remove_self_energies=True, normalize=True)

To obtain the total energy both the substraction of the self-energies and the normalization need to be reverted. This can be done by passing both Postprocessing classes to the ProstprocessingPipeline:

model = SchNET(
    embedding_module=embedding,
    nr_interaction_blocks=nr_interaction_blocks,
    radial_basis_module=rbf,
    cutoff_module=cutoff,
    nr_filters=nr_filters,
    postprocessing=PostprocessingPipeline(
        [AddSelfEnergies(dataset.dataset_statistics), UndoNormalization(dataset.dataset_statistics)]
    ),
)

All of the Postprocessing classes can also be passed without dataset.dataset_statistics, in that case they will check if the dataset_statistics are already provided by the model (a trained model will provide these), and if not, raise an error.

Todos

[x] self-energies infrastructure
[x] postprocessing classes
[x] Update LightningDataset and TorchDataset to remove atomic offset/normalize

Status

[x] Ready to go

jchodera commented 5 months ago

Could adding the offset to the energy degrade the precision of rh energy and hence the precision of our loss function, where we need to subtract two large energies and are looking for << 1 kcal/mol errors in energy differences between snapshots?

If the energies are always converted to double precision before the offset is added, there is likely no problem. But if this might be done in single precision, we may want a different approach for computing losses where the offsets have been already removed.

(Apologies for accidental close/open! The mobile version is easy to hit the wrong button!)

wiederm commented 5 months ago

I agree with that concern. I am unsure about the best strategies, and I think we will have to empirically evaluate this (even if we use double precision).

My plan is to implement the transformation in this PR so that we can control the transformations independently in the computation graph. That will enable us to test the two scenarios: train on energies with and without offset.

Another possible scenario will be to test whether we want to normalize offset energies or not, again, something that we will have to determine empirically.

wiederm commented 5 months ago

@yuanqing-wang , do you have a preference or further comments on this issue?

jchodera commented 5 months ago

Just as a simple example of the issue: A druglike molecule might have a formation energy of around -1e6 kcal/mol. The smallest representable difference that float32 can represent is around 0.0625 kcal/mol if subtracting numbers of this magnitude. Unless promoted to double, this might also lead to lots of noise in gradients.

Bigger molecules (or molecules with heavier atoms, like halides) will have even more trouble.

wiederm commented 5 months ago

@jchodera , I think removing the self energies will improve this situation considerably.

wiederm commented 5 months ago

@chrisiacovella , did you already implement the self-energies for QM9 somewhere, or do you have opinions on where these should go? We can also discuss this offline later.

codecov-commenter commented 5 months ago

Codecov Report

Merging #63 (092c4cf) into main (4861157) will increase coverage by 2.32%. The diff coverage is 89.28%.

Additional details and impacted files

chrisiacovella commented 4 months ago

@chrisiacovella , did you already implement the self-energies for QM9 somewhere, or do you have opinions on where these should go? We can also discuss this offline later.

In the QM9 Dataset, they provided in an SI table thermochemical references for each of the atoms. E.g., here is the data for hydrogen:

        "H": {
            "ZPVE": 0.000000 * unit.hartree,
            "U_0K": -0.500273 * unit.hartree,
            "U_298.15K": -0.498857 * unit.hartree,
            "H_298.15K": -0.497912 * unit.hartree,
            "G_298.15K": -0.510927 * unit.hartree,
            "Cv": 2.981 * unit.calorie_per_mole / unit.kelvin,

For each molecule, I calculate the reference values (i.e., just summing up all the individual atomic contributions) for each of the states/quantities (so reference energy at 0K, at 298.15K, ethanlpy at 298.15K, etc.)

Obviously, the most relevant for our needs is the 0K energy. The datafile includes:

internal_energy_at_0K
reference_energy_at_0K
formation_energy_at_0K

where the formation_energy_at_0K = internal_energy_at_0K - reference_energy_at_0K

In the spice_openff dataset, we also have:

dft_total_energy
reference_energy
formation_energy where again, the formation_energy = dft_total_energy - reference_energy

The non openff level of theory spice has the formation_energy already in the hdf5 they distributed (but not the reference energy for each molecule); I used the same code that was in the downloader script used to generate the hdf5 file Peter posted on zenodo, so those values should be identical (and could be added in trivially, if we aren't going to use the formation_energy directly from the file).

ANI1x didn't provide the references that I could find in the publication (they may be in the code base but I haven't located them yet). The same level of theory is used as qm9, so those would probably be reasonable values to use.

wiederm commented 4 months ago

To pick up the conversation from here: https://github.com/choderalab/modelforge/pull/63#pullrequestreview-1898911989

Currently, each model contains the used atomic self energies as model instance parameters, so these are saved whenever the model is saved. Also, we are writing a toml file containing them.

We accept the self energies of the datasets (in fact, these should be preferred) as dict here: https://github.com/choderalab/modelforge/blob/c17db775c648098adfc4f75cf9d4eaec2bbc9cb2/modelforge/dataset/dataset.py#L557 We should provide a method to obtain them for the dataset internally though. I will open an issue about this.

wiederm commented 4 months ago

I will spin off some of the comments in new issues so we can merge this PR as soon as possible. New issues: https://github.com/choderalab/modelforge/issues/72 https://github.com/choderalab/modelforge/issues/73

wiederm commented 4 months ago

And, as @chrisiacovella mentioned here, I have added a reference_ase dictionary for the QM9 dataset that we can use internally.

wiederm commented 4 months ago

I added a AtomicSelfEnergies dataclass, that takes a dictionary of {str, float} atomic self-energies and returns the atomic self energies both by element name and atomic index. I added it to the qm9 dataclass.

choderalab / modelforge