Closed guaguabujianle closed 10 months ago
Instability is a common issue with large pretraining datasets like MPtrj, as there exist occasional outliers even after our efforts of data cleaning. The most obvious issue from your training codes is that you used MSE loss, which is sensitive to outliers, whereas the pretrained CHGNet was trained with Huber loss as mentioned in the paper.
If you encounter such issue again after changing to Huber loss, please consider modifying learning rate, batch size, etc.
You can reload the last epoch where the loss did not spike with:
trainer = Trainer.load('model_path')
Here is the example for a more stable trainer.
# Define Trainer
trainer = Trainer(
model=model,
targets="efsm",
energy_loss_ratio=1,
force_loss_ratio=1,
stress_loss_ratio=0.1,
mag_loss_ratio=0.1,
optimizer="Adam",
weight_decay=0,
scheduler="CosLR",
criterion="Huber",
delta=0.1,
epochs=30,
starting_epoch=0,
learning_rate=1e-3,
use_device="cuda",
print_freq=1000,
)
Hi, thank you very much for your help! The solution addressed my issue. I have a follow-up query. I've developed a model to predict the relaxed structure directly from the unrelaxed structure. I aim to train my model using the MPtrj dataset you provided. I'm primarily interested in the initial and final relaxed structures, rather than the intermediate stages. However, within the MPtrj dataset, each mp_id is associated with multiple structures. For instance, mp-1054 includes mp-913687-0-0, mp-1054-1-3, mp-1054-1-1, mp-1054-1-0, and mp-1793540-0-0. How can I determine which structure corresponds to the initial structure and which represents the final relaxed structure for a given mp_id? Would it be accurate to sort them based on energy_per_atom and then select the structure with the highest energy_per_atom as the initial structure, and the one with the lowest energy_per_atom as the final relaxed structure? Thank you.
Email (Optional)
yangzd@mail2.sysu.edu.cn
Version
v0.2.1
Which OS(es) are you using?
What happened?
Dear Authors,
Thank you for sharing the source code for CHGNet. I am attempting to train CHGNet from scratch using the Materials Project trajectory (MPtrj) dataset that you've provided. However, I've noticed some instability during the training process. Specifically, there are instances where the loss spikes significantly. The training process looks like this:
Epoch: [0][1/44451] | Time (1.624)(0.338) | Loss 0.3169(0.3169) | MAE e 0.403(0.403) f 0.052(0.052) Epoch: [0][100/44451] | Time (0.623)(0.335) | Loss 0.5776(1.4586) | MAE e 0.412(0.477) f 0.139(0.206) Epoch: [0][200/44451] | Time (0.626)(0.339) | Loss 0.4351(1.6517) | MAE e 0.399(0.458) f 0.116(0.208) Epoch: [0][300/44451] | Time (0.892)(0.607) | Loss 0.9594(1.2858) | MAE e 0.511(0.443) f 0.238(0.192) ... Epoch: [0][26300/44451] | Time (0.996)(0.654) | Loss 1.3189(0.6118) | MAE e 0.160(0.188) f 0.425(0.217) Epoch: [0][26400/44451] | Time (0.995)(0.653) | Loss 0.3473(0.6110) | MAE e 0.214(0.188) f 0.263(0.217) Epoch: [0][26500/44451] | Time (0.998)(0.655) | Loss 124.5792(1.6960) | MAE e 0.684(0.189) f 2.491(0.220) Epoch: [0][26600/44451] | Time (0.996)(0.654) | Loss 0.6696(641.2061) | MAE e 0.470(0.191) f 0.318(0.249) Epoch: [0][26700/44451] | Time (0.995)(0.653) | Loss 3.6322(638.8499) | MAE e 0.445(0.192) f 0.847(0.251) Epoch: [0][26800/44451] | Time (0.998)(0.656) | Loss 6.0389(636.4801) | MAE e 0.522(0.193) f 0.643(0.252) .... Epoch: [0][37300/44451] | Time (1.015)(0.665) | Loss 1.2740(1311.9422) | MAE e 0.544(0.279) f 0.345(0.567) Epoch: [0][37400/44451] | Time (1.015)(0.664) | Loss 2.2951(1308.5920) | MAE e 0.447(0.280) f 0.466(0.567) Epoch: [0][37500/44451] | Time (1.017)(0.666) | Loss 5.7568(1305.1261) | MAE e 0.596(0.280) f 0.605(0.567) Epoch: [0][37600/44451] | Time (1.016)(0.666) | Loss 26822.4453(1302.3834) | MAE e 0.540(0.281) f 15.234(0.567) Epoch: [0][37700/44451] | Time (1.016)(0.665) | Loss 13.8838(1299.0148) | MAE e 0.538(0.281) f 0.703(0.567) Epoch: [0][37800/44451] | Time (1.015)(0.664) | Loss 2.9397(1295.8719) | MAE e 0.441(0.282) f 0.345(0.567)
Is this expected behavior? I would greatly appreciate it if you could provide a full demo (maybe a .py file) illustrating how to train CHGNet from scratch. I aim to compare my recently developed model with CHGNet under identical conditions. I've included a code snippet above for your reference. Thank you very much.
Code snippet
Log output
Code of Conduct