MDIL-SNU / SevenNet

SevenNet - a graph neural network interatomic potential package supporting efficient multi-GPU parallel molecular dynamics simulations.
https://pubs.acs.org/doi/10.1021/acs.jctc.4c00190
GNU General Public License v3.0
105 stars 13 forks source link

Get nan values during train #90

Open thangckt opened 5 days ago

thangckt commented 5 days ago

Dear,

I try to train a simple model. and get all values in loss The logfile as attachment.

Can you have a little guide. Thank you so much

log.log

YutackPark commented 5 days ago

It seems like your data has no stress label or the label is strange (see 'Stress distribution' of log).

Have you tried with is_train_stress as False? The key is under train:

thangckt commented 5 days ago

hi @YutackPark I set it False

Then train now interrupt without any error, at log

Trainer initialized, ready to training
------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
Epoch 1/10  lr: 0.001000
------------------------------------------------------------------------------------------------------------------------

Do you know why? This is my input: input.txt

YutackPark commented 5 days ago

Firstly, you should uncomment "# - ['TotalLoss', 'None']". SevenNet needs total loss to determine the best checkpoint to save. However, SevenNet should raise an error and quit if this is the case.

I failed to reproduce the issue with the same input but a different training set. Maybe, it is just that training is very slow. Could you share your dataset if you don't mind?

thangckt commented 5 days ago

hi @YutackPark

The dataset at this link

With PR#89, you can set input as

data_format: 'ase' 
data_format_args:                         
        energy_key: 'TotEnergy'                 
        force_key: 'force' 

Then you can repoduce the problem. I confirn that, above problem occur on Windows, when I test on Linux the problem disappear, and code can run well