Unable to restart some GaMD simulations

jeeberhardt commented 2 years ago

Hi,

I am currently testing the GaMD protocol on "big" systems (GPCR with lipid membrane) and I have issues restarting some MD simulations that reached the time limit. It seems that for an unknown reason the system becomes unstable after only a few steps. FYI, I am running duplicates of the same system and I am able to successfully restart some of them (after trying several times). Do you have any solution to fix that issue?

Here attached the input XML file and the error that I am getting.

Thanks!

Best, Jerome.

lvotapka commented 2 years ago

Hi Jerome,

That is actually not an error, and is an expected warning that arises from an older version of ParmEd, and is harmless. It is likely that your simulations are continuing to run just fine after the warning is raised. Do your simulations continue running or do they halt after this warning?

jeeberhardt commented 2 years ago

Hi Lane,

It looks like ParmEd was the culprit here, updated it to the latest version and it works. Thanks!

To answer to your question, yes the calculations were stopped right after the warning.

EDIT: It seems that I talked too fast on that matter. I had the last version of ParmEd (3.4.3 from conda-forge), and I still have the same issue for the few left MD simulations that I was unable to restart. Everytime it stops the simulations.

lvotapka commented 2 years ago

Hi Jerome,

There is no error being produced, only a warning about the OpenMM import, which should not affect the simulations. Are you sure that no other error message is displayed when your simulations fail? Also, it's more convenient if you just paste the errors directly into the comments, so that we don't have to download the error file and open it manually.

jeeberhardt commented 2 years ago

Hi Lane,

Sorry about that, I don't know where I had my head at that day, I just realized that I didn't put the correct error message... So to be sure that the problem was not related to just a warning being considered as an error on our local cluster, I restarted the MD simulation within an interactive job.

Here's the output:

(mm) srun --nodes=1 --cpus-per-task=4 --mem-per-cpu=4G --time=0:30:00 --qos=30min --partition=pascal --gres=gpu:1 --pty bash
srun: job 61994332 queued and waiting for resources
srun: job 61994332 has been allocated resources
(mm) ls
system.inpcrd  system.pdb  system.prmtop  date.txt  lower-dual  lower-dual.xml  md_4294967294.err  md_4294967294.out  merge.inp  run_jobs.inp
(mm) gamdRunner -p OpenCL -r xml lower-dual.xml
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
restarting from saved checkpoint: lower-dual/gamd_restart.checkpoint at step: 116340000
Running:      163660000  steps
Failure on step 500
Particle coordinate is nan
stepCount:  53.0
windowCount:  1.5799006179572742e+59
stage:  1.0
stageOneIfValueIsZeroOrNegative:  -26497191.0
stageTwoIfValueIsZeroOrNegative:  2499713502756.0
stageThreeIfValueIsZeroOrNegative:  27499449002756.0
stageFourIfValueIsZeroOrNegative:  164998148502756.0
stageFiveIfValueIsZeroOrNegative:  8399983850002756.0
Vmax_Dihedral:  2.383039914797042e+66
Vmax_Total:  8.832317274826064e+244
Vmin_Dihedral:  3.5687844108716943e+129
Vmin_Total:  -3.392748492226618e+106
Vavg_Dihedral:  -4.545932152020869e-26
Vavg_Total:  1.4060231413362273e+210
oldVavg_Dihedral:  2.883805351158775e-263
oldVavg_Total:  3.9366612489152784e-42
sigmaV_Dihedral:  -4.081211932408778e-304
sigmaV_Total:  6.518644230977538e+295
M2_Dihedral:  -8.557944129478806e+282
M2_Total:  -124503035.16378178
wVavg_Dihedral:  -3.557412228062907e+47
wVavg_Total:  -3.052445310908564e+39
k_Dihedral:  -3.8556344013299853e-240
k_Total:  -5.165762982841169e-208
k0prime_Dihedral:  -2.666305870615468e+74
k0prime_Total:  6.741552816581078e+211
k0doubleprime_Dihedral:  -0.00012973204321248944
k0doubleprime_Total:  -1.1148705641481027e+83
k0doubleprime_window_Dihedral:  -2.104236648750214e-120
k0doubleprime_window_Total:  -8.623580592758635e-267
boosted_energy_Dihedral:  -1.9095338710386873e-11
boosted_energy_Total:  -1.7873488232394516e-200
check_boost_Dihedral:  -3.895616360512721e+158
check_boost_Total:  9.627556564818428e-08
threshold_energy_Dihedral:  3119736.4695280353
threshold_energy_Total:  7.84039081015344e-286
thermal_energy:  1.7604647507373824e+109
collision_rate:  -1.0729571471930571e-90
vscale:  1.0
fscale:  -0.0
noisescale:  0.0
StartingPotentialEnergy_Dihedral:  nan
StartingPotentialEnergy_Total:  nan
ForceScalingFactor_Dihedral:  4.3550441268881773e+136
ForceScalingFactor_Total:  1.980489975163592e+28
BoostPotential_Dihedral:  1.8425597917388823e-49
BoostPotential_Total:  4.845944560916026e-293
k0_Dihedral:  -9.233254446365938e-282
k0_Total:  -1.2558769805332703e+249
sigma0_Total:  -1.6112857510977502e-233
sigma0_Dihedral:  6.004057998728309e-61

After that, the execution just stop. Sometimes, it takes several retries/restarts to finally overcome that error. This error only happens when I am trying to restart a MD simulation, never during.

Best,

MiaoLab20 / gamd-openmm

Unable to restart some GaMD simulations #24