Hello MACE developers, your help would be appreciated:
Describe the bug
LAMMPS with ML-MACE crashes at different timestep upon sending the very same input script on the very same hosts, with the same stacktrace of an illegal memory access encountered.
LAMMPS scripts works as expected with MACE-MP-0 L0 trained model provided by this repository. But fails at seemingly random timepoint with trained MACE model. Used same training command as MACE-MP-0 L0 except for distributed/num_workers command.
To Reproduce
Steps to reproduce the behavior:
LAMMPS Input script:
variable dt equal dt
variable time equal time
variable temp equal temp
variable etotal equal etotal
variable press equal press
variable lx equal lx
variable vol equal vol
variable density equal density
read_data data.init.read
replicate 2 2 2
newton on
pair_style mace no_domain_decomposition
pair_coeff * * model-lammps_L0.pt Si O C H
compute temp all temp
compute com all com
compute keatom all ke/atom
thermo 10
dump d1 all atom 10 dumpmin.atom
minimize 0.0 1.0e-8 5000 100000
undump d1
write_restart restart.min.ac
write_data data.min.read
timestep 0.0001
variable tempini equal 1000
variable tempfin equal 3000
variable rate equal 100 #1E-2K/fs
variable nstep equal "(v_tempfin - v_tempini)/v_rate/v_dt"
variable neverydmp equal "v_nstep/20"
variable neveryprnt equal "v_nstep/200"
variable vscale equal 1.0
#print "${neverydmp}"
#------------------------------------------------------------------------------------------------------------
# Temperature ramp
reset_timestep 0
velocity all create ${tempini} 142857 mom yes rot yes dist gaussian
#fix fi3 all print ${neveryprnt} "${time} ${temp} ${etotal} ${press} ${lx} ${vol} ${density}" screen no append thermovals.dat
fix fi3 all print 100 "${time} ${temp} ${etotal} ${press} ${lx} ${vol} ${density}" screen no append thermovals.dat
fix fi2 all deform 1 x scale ${vscale} y scale ${vscale} z scale ${vscale} remap none
fix fi1 all nvt temp ${tempini} ${tempfin} 0.010
dump d1 all atom ${neverydmp} dumpmeltramp.atom
thermo ${neverydmp}
thermo_style custom step temp lx ly lz etotal pxx pyy pzz
run ${nstep}
unfix fi1
unfix fi2
unfix fi3
undump d1
write_restart restart.meltramp.ac
PyTorch 1.13.1-rc1 compiled (pre-compiled zip file doesn't work with SLES12 due to old GLIBC version)
Additional context
Simulation is completely stable with pre-trained OS. There is a possibility that the trained model file has been trained on a different PyTorch version.
Hello MACE developers, your help would be appreciated:
Describe the bug LAMMPS with ML-MACE crashes at different timestep upon sending the very same input script on the very same hosts, with the same stacktrace of an illegal memory access encountered.
LAMMPS scripts works as expected with MACE-MP-0 L0 trained model provided by this repository. But fails at seemingly random timepoint with trained MACE model. Used same training command as MACE-MP-0 L0 except for distributed/num_workers command.
To Reproduce Steps to reproduce the behavior:
LAMMPS Input script:
Input files below (initial structure, LAMMPS trained model): input_files.zip
Stacktrace:
System setup (please complete the following information):
Additional context Simulation is completely stable with pre-trained OS. There is a possibility that the trained model file has been trained on a different PyTorch version.