Question to training script

Hi everyone,

thank you for the great work on physics-ml! We (tagging here @sboresch and @AnnaPicha) have been using physicsml to train MACE models on different datasets.

We have a few questions about the training:

In what unit is the training dataset and in what unit should the self-energies be provided? Should the atomic self-energies be provided in the energy unit of the training set or is it a specific unit the training set is converted to and the self-energies should be provided in? Specifically, how should we do this for the ANI2x and the SPICE dataset?
How do we obtain "scaling_mean" and "scaling_std" for a given training dataset?

We have trained the MACE model with the following script. Does this look reasonable for you?

# This script aims to train the MACE (ML-)potential, using the ANI2x training data set
#--------------------------------------------------------------------------------------------------------

from molflux.datasets import load_dataset
from molflux.datasets import list_datasets

dataset = load_dataset("ani2x", "rdkit", level_of_theory='wB97MD3BJ/def2TZVPP')

import logging
import torch
torch._C._jit_set_nvfuser_enabled(False)

logging.disable(logging.CRITICAL)

from molflux.core import featurise_dataset

featurisation_metadata = {
    "version": 1,
    "config": [
        {
            "column": "mol_bytes",
            "representations": [
                {
                    "name": "physicsml_features",
                    "config": {
                        "atomic_number_mapping": {
                            1: 0,
                            6: 1,
                            7: 2,
                            8: 3,
                            9: 4,
                            16: 5,
                            17: 6,
                        },
                        "atomic_energies": {
                            1: -0.5978583943827134,
                            6: -38.08933878049795,
                            7: -54.711968298621066,
                            8: -75.19106774742086,
                            9: -99.80348506781634,
                            16: -398.1577125334925,
                            17: -460.1681939421027,                        
                        },
                        "backend": "rdkit",
                    },
                    "as": "{feature_name}",
                }
            ],
        }
    ],
}

featurised_dataset = featurise_dataset(
    dataset,
    featurisation_metadata=featurisation_metadata,
    num_proc=8,
    batch_size=1_000,
)

from molflux.datasets import split_dataset
from molflux.splits import load_from_dict as load_split_from_dict

shuffle_strategy = load_split_from_dict(
    {
        "name": "shuffle_split",
        "presets": {
            "train_fraction": 0.8,
            "validation_fraction": 0.1,
            "test_fraction": 0.1,
        },
    }
)

split_featurised_dataset = next(split_dataset(featurised_dataset, shuffle_strategy))

from molflux.modelzoo import load_from_dict as load_model_from_dict

model = load_model_from_dict(
    {
        "name": "mace_model",                   # model name
        "config": {
            "x_features": [                     # x features
                "physicsml_atom_idxs",
                "physicsml_atom_numbers",
                "physicsml_coordinates",
                "physicsml_total_atomic_energy",
            ],
            "y_features": ["energies"],         # y features #change to "energies" for ani2x data set
            "datamodule": {                     # The datamodule config
                "num_workers" : 7,
                "y_graph_scalars": [
                    "energies"                  #change to "energies" for ani2x data set
                ],                              # specify which y features are graph level scalars
                "num_elements": 7,
                "cut_off": 5.0,
                # "pre_batch": "on_disk",        # pre batch the dataset for faster data loading
                "train": {"batch_size": 128},   # specify the training batch size
                "validation": {
                    "batch_size": 128
                },                              # specify the val batch size (which can be different from the train size)
            },
            "num_node_feats": 7,
            "num_edge_feats": 0,
            "num_bessel": 8,
            "num_polynomial_cutoff": 5,
            "max_ell": 3,
            "num_interactions": 2,
            "hidden_irreps": "128x0e + 128x1o",
            "mlp_irreps": "16x0e",
            "avg_num_neighbours": 12.0,
            "correlation": 3,
            "y_graph_scalars_loss_config": {    # the loss config for the y graph scalars
                "name": "WeightedMSELoss",  
            },
            "optimizer": {                      # The optimizer config
                "name": "AdamW",
                "config": {
                    "lr": 1e-3,
                    "amsgrad": True,
                    "weight_decay": 5.0e-7,
                },
            },
            "trainer": {
                "accelerator": "gpu",
                "precision": 32,
                "devices": 1,
                #"strategy": "ddp",
                "callbacks": [
                    {"name": "LearningRateMonitor"},
                    {
                        "config": {
                            "patience": 100,
                            "monitor": "val/total/loss",
                            "mode": "min",
                        },
                        "name": "EarlyStopping",
                    },
                    {
                        "config": {
                            "dirpath": "ANI_model_training/checkpoints/mace/",
                            "every_n_epochs": 5,
                            "save_top_k": 1,
                            "monitor": "val/total/loss",
                            "save_last": True,
                        },
                        "name": "ModelCheckpoint",
                    },
                ],
                "enable_checkpointing": True,
                "gradient_clip_val": 20.0,
                "max_epochs": 350,
                "gradient_clip_algorithm": "norm",
            },
        },
    }
)

# train model
model.train(
    train_data=split_featurised_dataset["train"],
    validation_data=split_featurised_dataset["validation"],
)

# save model
from molflux.core import save_model
save_model(model, "trainedmodel", featurisation_metadata)

Thank you very much for your help!

Hello! So:

In what unit is the training dataset and in what unit should the self-energies be provided? Should the atomic self-energies be provided in the energy unit of the training set or is it a specific unit the training set is converted to and the self-energies should be provided in? Specifically, how should we do this for the ANI2x and the SPICE dataset?

The self energies should be provided in whatever units the training energies are given to model (so if the model receives kcal/mol, the self energies should be in kcal/mol). We do not do any unit conversions in physicsml (all conversions are up to the user). For the ANI2x and SPICE dataset, the self energies should be in whatever unit you choose to convert the energies of the dataset into and give to the model (with that said, there are certain choices of units which are better for numerical stability than others, it depends on how large the (energy - self_energy) scale is). Does this make sense?

How do we obtain "scaling_mean" and "scaling_std" for a given training dataset?

Good question! This depends on the model architecture actually. I will add a PR to clarify this more in the docs. But for the MACE model, the

scaling_mean: is the mean over the training dataset of (molecule_energy - molecule_self_energy) / molecule_num_atoms.
scaling_std: is the std over the training dataset of (molecule_energy - molecule_self_energy).

This is because the MACE model applies the scale and shift to the individual node energies before adding the self energies. So the scaling_std is the std of the difference of energy and self_energy and the scaling_mean must be normalised by the number of atoms (since the final energy is the sum of the node energies).

For the script: Everything looks good, but one thing about checkpointing: The ModelCheckpoint will only save the checkpoints to dir specified, but the best checkpoint will not be applied to the model (the model will be saved with the weights of the last epoch). To apply the best weights to the model, use the ModelCheckpointApply callback (which is identical to the ModelCheckpoint callback, but applies the best model weights to the model at the end of training).

Hope this helps! Let me know if you have more questions!

Exscientia / physicsml

Question to training script #27