Loss spikes during training expected?

guaguabujianle commented 10 months ago

Email (Optional)

yangzd@mail2.sysu.edu.cn

Version

v0.2.1

Which OS(es) are you using?

[ ] MacOS
[ ] Windows
[X] Linux

What happened?

Dear Authors,

Thank you for sharing the source code for CHGNet. I am attempting to train CHGNet from scratch using the Materials Project trajectory (MPtrj) dataset that you've provided. However, I've noticed some instability during the training process. Specifically, there are instances where the loss spikes significantly. The training process looks like this:

Is this expected behavior? I would greatly appreciate it if you could provide a full demo (maybe a .py file) illustrating how to train CHGNet from scratch. I aim to compare my recently developed model with CHGNet under identical conditions. I've included a code snippet above for your reference. Thank you very much.

Code snippet

# %%
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "1"
import pickle as pk
from chgnet.data.dataset import StructureData, collate_graphs, StructureJsonData
from chgnet.graph.converter import CrystalGraphConverter
from chgnet.trainer import Trainer
from pymatgen.core.structure import Structure
from torch.utils.data import DataLoader, Dataset
from chgnet.model import CHGNet

# %%
data_path = '/scratch/yangzd/materials/data/MPtrj_2022.9_full.json'

graph_converter = CrystalGraphConverter(atom_graph_cutoff=5, bond_graph_cutoff=3)
dataset = StructureJsonData(data_path, graph_converter, targets="ef")
train_loader, val_loader, test_loader = dataset.get_train_val_test_loader(train_ratio=0.9, val_ratio=0.05)

# %%
chgnet = CHGNet()
trainer = Trainer(
    model=chgnet,
    targets="ef",
    optimizer="Adam",
    criterion="MSE",
    learning_rate=1e-3,
    epochs=50,
    use_device="cuda",
)
trainer.train(train_loader, val_loader, test_loader)

Log output

Epoch: [0][1/44451] | Time (1.624)(0.338) | Loss 0.3169(0.3169) | MAE e 0.403(0.403)  f 0.052(0.052)
Epoch: [0][100/44451] | Time (0.623)(0.335) | Loss 0.5776(1.4586) | MAE e 0.412(0.477)  f 0.139(0.206)
Epoch: [0][200/44451] | Time (0.626)(0.339) | Loss 0.4351(1.6517) | MAE e 0.399(0.458)  f 0.116(0.208)
Epoch: [0][300/44451] | Time (0.892)(0.607) | Loss 0.9594(1.2858) | MAE e 0.511(0.443)  f 0.238(0.192)
...
Epoch: [0][26300/44451] | Time (0.996)(0.654) | Loss 1.3189(0.6118) | MAE e 0.160(0.188)  f 0.425(0.217)
Epoch: [0][26400/44451] | Time (0.995)(0.653) | Loss 0.3473(0.6110) | MAE e 0.214(0.188)  f 0.263(0.217)
Epoch: [0][26500/44451] | Time (0.998)(0.655) | Loss 124.5792(1.6960) | MAE e 0.684(0.189)  f 2.491(0.220)
Epoch: [0][26600/44451] | Time (0.996)(0.654) | Loss 0.6696(641.2061) | MAE e 0.470(0.191)  f 0.318(0.249)
Epoch: [0][26700/44451] | Time (0.995)(0.653) | Loss 3.6322(638.8499) | MAE e 0.445(0.192)  f 0.847(0.251)
Epoch: [0][26800/44451] | Time (0.998)(0.656) | Loss 6.0389(636.4801) | MAE e 0.522(0.193)  f 0.643(0.252)
....
Epoch: [0][37300/44451] | Time (1.015)(0.665) | Loss 1.2740(1311.9422) | MAE e 0.544(0.279)  f 0.345(0.567)
Epoch: [0][37400/44451] | Time (1.015)(0.664) | Loss 2.2951(1308.5920) | MAE e 0.447(0.280)  f 0.466(0.567)
Epoch: [0][37500/44451] | Time (1.017)(0.666) | Loss 5.7568(1305.1261) | MAE e 0.596(0.280)  f 0.605(0.567)
Epoch: [0][37600/44451] | Time (1.016)(0.666) | Loss 26822.4453(1302.3834) | MAE e 0.540(0.281)  f 15.234(0.567)
Epoch: [0][37700/44451] | Time (1.016)(0.665) | Loss 13.8838(1299.0148) | MAE e 0.538(0.281)  f 0.703(0.567)
Epoch: [0][37800/44451] | Time (1.015)(0.664) | Loss 2.9397(1295.8719) | MAE e 0.441(0.282)  f 0.345(0.567)

Code of Conduct

[X] I agree to follow this project's Code of Conduct

BowenD-UCB commented 10 months ago

Instability is a common issue with large pretraining datasets like MPtrj, as there exist occasional outliers even after our efforts of data cleaning. The most obvious issue from your training codes is that you used MSE loss, which is sensitive to outliers, whereas the pretrained CHGNet was trained with Huber loss as mentioned in the paper.

If you encounter such issue again after changing to Huber loss, please consider modifying learning rate, batch size, etc. You can reload the last epoch where the loss did not spike with: trainer = Trainer.load('model_path')

BowenD-UCB commented 10 months ago

Here is the example for a more stable trainer.

# Define Trainer
trainer = Trainer(
    model=model,
    targets="efsm",
    energy_loss_ratio=1,
    force_loss_ratio=1,
    stress_loss_ratio=0.1,
    mag_loss_ratio=0.1,
    optimizer="Adam",
    weight_decay=0,
    scheduler="CosLR",
    criterion="Huber",
    delta=0.1,
    epochs=30,
    starting_epoch=0,
    learning_rate=1e-3,
    use_device="cuda",
    print_freq=1000,
)

guaguabujianle commented 10 months ago

Hi, thank you very much for your help! The solution addressed my issue. I have a follow-up query. I've developed a model to predict the relaxed structure directly from the unrelaxed structure. I aim to train my model using the MPtrj dataset you provided. I'm primarily interested in the initial and final relaxed structures, rather than the intermediate stages. However, within the MPtrj dataset, each mp_id is associated with multiple structures. For instance, mp-1054 includes mp-913687-0-0, mp-1054-1-3, mp-1054-1-1, mp-1054-1-0, and mp-1793540-0-0. How can I determine which structure corresponds to the initial structure and which represents the final relaxed structure for a given mp_id? Would it be accurate to sort them based on energy_per_atom and then select the structure with the highest energy_per_atom as the initial structure, and the one with the lowest energy_per_atom as the final relaxed structure? Thank you.

CederGroupHub / chgnet