LAMMPS-MACE with Kokkos: Illegal memory access encountered with trained MACE but works with MP-MACE-0

mhsiron commented 4 months ago

Hello MACE developers, your help would be appreciated:

Describe the bug LAMMPS with ML-MACE crashes at different timestep upon sending the very same input script on the very same hosts, with the same stacktrace of an illegal memory access encountered.

LAMMPS scripts works as expected with MACE-MP-0 L0 trained model provided by this repository. But fails at seemingly random timepoint with trained MACE model. Used same training command as MACE-MP-0 L0 except for distributed/num_workers command.

To Reproduce Steps to reproduce the behavior:

LAMMPS Input script:

variable dt     equal dt
variable time   equal time
variable temp   equal temp 
variable etotal equal etotal
variable press  equal press 
variable lx     equal lx 
variable vol    equal vol 
variable density equal density

read_data               data.init.read
replicate 2 2 2

newton on
pair_style mace no_domain_decomposition
pair_coeff * * model-lammps_L0.pt  Si O C H 

compute         temp    all temp
compute         com         all     com
compute         keatom      all     ke/atom

thermo 10
dump d1                 all atom 10 dumpmin.atom  
minimize                0.0     1.0e-8  5000    100000
undump d1

write_restart   restart.min.ac
write_data      data.min.read

timestep                0.0001
variable tempini        equal 1000
variable tempfin        equal 3000
variable rate           equal 100 #1E-2K/fs
variable nstep          equal "(v_tempfin - v_tempini)/v_rate/v_dt"
variable neverydmp      equal "v_nstep/20"
variable neveryprnt equal "v_nstep/200"
variable vscale         equal 1.0
#print "${neverydmp}"

#------------------------------------------------------------------------------------------------------------
# Temperature ramp
reset_timestep  0
velocity                all     create ${tempini} 142857        mom yes rot yes dist gaussian
#fix fi3                all print ${neveryprnt} "${time} ${temp} ${etotal} ${press} ${lx} ${vol} ${density}"  screen no append thermovals.dat
fix fi3                 all print 100 "${time} ${temp} ${etotal} ${press} ${lx} ${vol} ${density}"  screen no append thermovals.dat
fix fi2                 all deform 1 x scale ${vscale} y scale ${vscale} z scale ${vscale} remap none
fix fi1                 all     nvt temp ${tempini} ${tempfin} 0.010
dump d1                 all atom ${neverydmp} dumpmeltramp.atom  
thermo ${neverydmp}
thermo_style    custom  step temp lx ly lz etotal pxx pyy pzz
run ${nstep}
unfix fi1
unfix fi2
unfix fi3
undump d1

write_restart   restart.meltramp.ac

Input files below (initial structure, LAMMPS trained model): input_files.zip

Stacktrace:

cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /tools/mace/lammps/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:161
Backtrace:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Kokkos::Impl::save_stacktrace() [0x2aaaabdf51a5]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Kokkos::Impl::traceback_callstack(std::ostream&) [0x2aaaabdec65a]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Kokkos::Impl::host_abort(char const*) [0x2aaaabdec6bb]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Kokkos::Impl::cuda_internal_error_abort(cudaError, char const*, char const*, int) [0x2aaaabdfa74a]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Kokkos::Impl::cuda_device_synchronize(std::string const&) [0x2aaaabdfa811]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Kokkos::Impl::ExecSpaceManager::static_fence(std::string const&) [0x2aaaabdd4395]
void Kokkos::deep_copy<int, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks, int, Kokkos::LayoutLeft, Kokkos::Cuda, void>(Kokkos::View<int, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks> const&, Kokkos::View<int, Kokkos::LayoutLeft, Kokkos::Cuda, void> const&, std::enable_if<((std::is_void<Kokkos::ViewTraits<int, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>::specialize>::value&&std::is_void<Kokkos::ViewTraits<int, Kokkos::LayoutLeft, Kokkos::Cuda, void>::specialize>::value)&&(((unsigned int)Kokkos::ViewTraits<int, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>::rank)==((unsigned int)(0))))&&(((unsigned int)Kokkos::ViewTraits<int, Kokkos::LayoutLeft, Kokkos::Cuda, void>::rank)==((unsigned int)(0))), void>::type*) [0x2aaaab5ccef6]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         LAMMPS_NS::NBinKokkos<Kokkos::Cuda>::bin_atoms() [0x2aaaab6b947b]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          void LAMMPS_NS::NeighborKokkos::build_kokkos<Kokkos::Cuda>(int) [0x2aaaab65eb61]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        LAMMPS_NS::VerletKokkos::run(int) [0x2aaaab98bd5c]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     LAMMPS_NS::Run::command(int, char**) [0x2aaaab4cf6bb]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      LAMMPS_NS::Input::execute_command() [0x2aaaab364765]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 LAMMPS_NS::Input::file() [0x2aaaab364a4d]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              LAMMPS_NS::Input::include() [0x2aaaab364ffd]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      LAMMPS_NS::Input::execute_command() [0x2aaaab363f37]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 LAMMPS_NS::Input::file() [0x2aaaab364a4d]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                [0x40473a]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        __libc_start_main [0x2aaaaefe7ac5]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                [0x4048d9]
[scff292100:22082] *** Process received signal ***
[scff292100:22082] Signal: Aborted (6)
[scff292100:22082] Signal code:  (-6)
[scff292100:22082] [ 0] /lib64/libpthread.so.0(+0x11ce0)[0x2aaaaec52ce0]
[scff292100:22082] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaaaeffc0c7]
[scff292100:22082] [ 2] /lib64/libc.so.6(abort+0x13a)[0x2aaaaeffd49a]
[scff292100:22082] [ 3] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN6Kokkos4Impl17human_memory_sizeEm+0x0)[0x2aaaabdec6c0]
[scff292100:22082] [ 4] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN6Kokkos4Impl25cuda_internal_error_abortE9cudaErrorPKcS3_i+0xea)[0x2aaaabdfa74a]
[scff292100:22082] [ 5] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN6Kokkos4Impl23cuda_device_synchronizeERKSs+0xb1)[0x2aaaabdfa811]
[scff292100:22082] [ 6] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN6Kokkos4Impl16ExecSpaceManager12static_fenceERKSs+0x25)[0x2aaaabdd4395]
[scff292100:22082] [ 7] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN6Kokkos9deep_copyIiJNS_10LayoutLeftENS_6DeviceINS_6OpenMPENS_9HostSpaceEEENS_12Experimental14EmptyViewHooksEEiJS1_NS_4CudaEvEEEvRKNS_4ViewIT_JDpT0_EEERKNS9_IT1_JDpT2_EEEPNSt9enable_ifIXaaaaaasrSt7is_voidINS_10ViewTraitsISA_JSC_EE10specializeEE5valuesrSN_INSO_ISG_JSI_EE10specializeEE5valueeqcvjsrSP_4rankcvjLi0EeqcvjsrSS_4rankcvjLi0EEvE4typeE+0x196)[0x2aaaab5ccef6]
[scff292100:22082] [ 8] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS10NBinKokkosIN6Kokkos4CudaEE9bin_atomsEv+0x143b)[0x2aaaab6b947b]
[scff292100:22082] [ 9] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS14NeighborKokkos12build_kokkosIN6Kokkos4CudaEEEvi+0x1d1)[0x2aaaab65eb61]
[scff292100:22082] [10] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS12VerletKokkos3runEi+0x124c)[0x2aaaab98bd5c]
[scff292100:22082] [11] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xd1b)[0x2aaaab4cf6bb]
[scff292100:22082] [12] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0xd65)[0x2aaaab364765]
[scff292100:22082] [13] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x15d)[0x2aaaab364a4d]
[scff292100:22082] [14] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS5Input7includeEv+0xed)[0x2aaaab364ffd]
[scff292100:22082] [15] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0x537)[0x2aaaab363f37]
[scff292100:22082] [16] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x15d)[0x2aaaab364a4d]
[scff292100:22082] [17] lmp[0x40473a]
[scff292100:22082] [18] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaaefe7ac5]
[scff292100:22082] [19] lmp[0x4048d9]
[scff292100:22082] *** End of error message ***
/data/relax_rdf_reaxff_si16o25c15_ens-0_qu-10Kpps_3000K1.4DEN/run.sh: line 18: 22082 Aborted                 (core dumped) lmp -k on g 1 -sf kk -in in.rdf

System setup (please complete the following information):

OS: SLES12
LAMMPS compile command: cmake -C ../cmake/presets/kokkos-cuda.cmake ../cmake -DPKG_KOKKOS=ON -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_AMPERE80=yes -DBUILD_OMP=yes -D BUILD_MPI=yes -D BUILD_SHARED_LIBS=yes -D LAMMPS_EXCEPTIONS=yes -D PKG_OPENMP=yes -D PKG_OPENMP=yes -D Kokkos_ENABLE_OPENMP=yes -D Kokkos_ENABLE_CUDA=yes -DCUDATOOLKIT_ROOT_DIR=/usr/local/cuda-11.6 -DKokkos_ARCH_PASCAL60=no -DCMAKE_PREFIX_PATH=/tools/pytorch/torch/share/cmake/ -DCMAKE_CXX_COMPILER=/tools/mace/lammps/lib/kokkos/bin/nvcc_wrapper -D Kokkos_ARCH_AMDAVX=yes -D Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=no -D Kokkos_ENABLE_CUDA_UVM=no -D PKG_ML-MACE=yes
- GCC 8.2
- CUDA 11.6
- A100 GPUs
- PyTorch 1.13.1-rc1 compiled (pre-compiled zip file doesn't work with SLES12 due to old GLIBC version)

Additional context Simulation is completely stable with pre-trained OS. There is a possibility that the trained model file has been trained on a different PyTorch version.

wcwitt commented 4 months ago

Hi - can you try with

atom_modify map yes

as mentioned here https://mace-docs.readthedocs.io/en/latest/guide/lammps.html#using-the-model-in-lammps?

I'm not sure that's it, but worth checking.

owen-rett commented 4 months ago

Encountering a similar issue; I've attached the error file and relevant sections from the lammps input script.

It seems to be failing consistently during a nvt run, but seems to have no issue with a npt style run. I've tried restarting, using different random seeds and starting structures but the simulation is consistently failing about 2000 steps into the NVT section of the simulation. Array.25022699_1.err.txt md_LPK.lmp.txt

Apologies for the somewhat badly commented lammps script; was just modifying an old script to quickly get some data.

wcwitt commented 4 months ago

@owen-rett I'm trying to understand this error message

/var/spool/slurmd/job25022700/slurm_script: line 54: 3599961 Aborted                 /home/gridsan/orettenmaier/Lammps_MACE/lammps/build/lmp -k on g 1 -sf kk -in md_LPK.lmp
Traceback (most recent call last):
  File "MSD_K_Det.py", line 29, in <module>
    K_Zr, K_Ce, K_O = msd_K_trans(Temp, Mean_msdZr, Mean_msdO)
  File "MSD_K_Det.py", line 10, in msd_K_trans
    if msdCe < 0.00001:
UnboundLocalError: local variable 'msdCe' referenced before assignment

Is this the root cause or something that's happening after LAMMPS fails?

owen-rett commented 4 months ago

Ah, sorry, the slurm script I'm using runs a initial simulation to determine lattice constants and mean squared displacements, calls a python script prepare subsequent simulations, and then performs subsequent simulationss. The error there is from the python script having a typo in it (shifting from having 3 species to 2 species). That's my fault. The quoted error is entirely unrelated to lammps.

The initial simulation still does exhibit the same error as mhsiron during a NVT (with langevin integration) section however.

Edit: For Clarity

wcwitt commented 4 months ago

I don't see anything wrong with your input on first pass. If it's true that the problem is happening long into an NVT simulation I'm worried it will be challenging to debug. Can you try to reduce it to a minimal example (as few LAMMPS commands as possible) that fails reliably. I can try to reproduce from that - you can email your model if you don't want to post it. Sorry, not sure what else to suggest

mhsiron commented 4 months ago

Hi - can you try with
atom_modify map yes
as mentioned here https://mace-docs.readthedocs.io/en/latest/guide/lammps.html#using-the-model-in-lammps?

I'm not sure that's it, but worth checking.

Hi @wcwitt same error with atom_modify map yes. However, I may have found out a bit more on what might causes the error:

It appears models trained with L>0 appear to crash. Turns out my training command had max_L=1. Only models trained with max_L=0 seem to work for my simulation. Could this a memory issue? My A100 GPU has 80GB of memory, and this seems like a relatively small simulation ~1300 atoms.

For the MACE-MP-0 medium and large models I get an explicit CUDA memory stacktrace, which I do not receive for the L0 model, but for the trained models I get the more ambiguous stack trace above.

wcwitt commented 4 months ago

Hi @mhsiron thanks for this. That does seem a bit small for a memory issue, but could you try with ~500 atoms just to see?

mhsiron commented 4 months ago

Hi @wcwitt,

I did a ~150 atom simulation with an L1 model, and it indeed works. GPU volatility is in the range of ~60-90% and: Memory usage seems to be around 22G. From this it makes sense that ~10x more atoms would make the GPU run out of memory.

My question, are the L1 models that memory intensive? Or does this point to a potential memory leak somewhere?

gabor1 commented 4 months ago

certainly memory requirements go up when you go from L=0 to L=1 and even more if you go to L=2

ilyes319 commented 4 months ago

What's the density of your system (in terms of average number of neighbors during the simulation). The L=1 model should run 1K atoms on 80GB quite easily if the density is below 50.

wcwitt commented 4 months ago

Hi @mhsiron thanks for sticking with this - we definitely appreciate the detailed reports.

My question, are the L1 models that memory intensive? Or does this point to a potential memory leak somewhere?

I'm not sure. Like @ilyes319, I wouldn't normally expect problems on that machine with L=1, <2000 atoms. But use of the LAMMPS interface been fairly low until recently, so I'm open to all options.

If you have time, you could try launching analgous calculations from Python/ASE, just to see if the memory limitations are similar.

mhsiron commented 4 months ago

Hi all,

Per LAMMPS output for L1 on the <150 atom simulation:

Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 14
  ghost atom cutoff = 14
  binsize = 14, bins = 2 1 2
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair mace/kk, perpetual
      attributes: full, newton on, kokkos_device
      pair build: full/bin/kk/device
      stencil: full/bin/3d
      bin: kk/device

These appear to be default as I have not set any command that would change it, I have tried adding: neighbor 10.0 bin

But it appears to be overridden? At least I get the same LAMMPS output.

I added: neigh_modify one 50 page 2500

Will report back!

mhsiron commented 4 months ago

Does not seem to help. Actually the L0 simulation also prove unstable. The only stable model I can run is the pre-trained MACE-MP-0 small L0 model.

Another peculiar thing I noticed is the temperature in the log suddenly goes to quite low temperature prior to crash, example below, the columns are: timestep (ps), temperature (K), total energy (eV), pressure (bar), length in x (A), volume (A^3),density:

11.5 1458.80157385003 -8173.93823005146 -3148.14775245486 29.49934686 24560.4591488324 1.4538743316293
11.75 1375.7611038842 -8186.71536429021 2634.97363948781 29.49934686 24560.4591488324 1.4538743316293
12 1425.31323684682 -8181.87041265065 -666.217691988384 29.49934686 24560.4591488324 1.4538743316293
12.25 1431.21964335045 -8191.86786235401 -2823.12112557144 29.49934686 24560.4591488324 1.4538743316293
12.5 0.00157387697505881 -8344.35923152783 -8366.67115057049 29.49934686 24560.4591488324 1.4538743316293
12.75 1.22775118971289e-17 -8326.09875594358 -6949.76472390582 29.49934686 24560.4591488324 1.4538743316293
13 2.28493309030338e-08 -8371.61500883076 -11399.9341038866 29.49934686 24560.4591488324 1.4538743316293
13.25 8.77094472954126e-07 -8391.24708921418 -13039.2885264807 29.49934686 24560.4591488324 1.4538743316293
13.5 5.24128354736295e-06 -8404.93236755041 -13464.1034805214 29.49934686 24560.4591488324 1.4538743316293
13.75 1.67908974908941e-05 -8416.07696635189 -13365.7167324098 29.49934686 24560.4591488324 1.4538743316293
14 3.92758414519387e-05 -8426.00378684878 -13130.2069142645 29.49934686 24560.4591488324 1.4538743316293
14.25 7.64073353514439e-05 -8435.0017582836 -12967.8196410895 29.49934686 24560.4591488324 1.4538743316293
14.5 0.00013213828234377 -8443.90439332092 -12188.7756332344 29.49934686 24560.4591488324 1.4538743316293
14.75 0.000265051994964286 -8454.90567015758 -11823.8413844585 29.49934686 24560.4591488324 1.4538743316293
15 0.000267630919262236 -8462.56120861329 -11693.4464379121 29.49934686 24560.4591488324 1.4538743316293
15.25 0.000345222275526478 -8468.46583212703 -11477.1279654604 29.49934686 24560.4591488324 1.4538743316293
15.5 0.00043740866282505 -8473.63262891532 -11208.4617558759 29.49934686 24560.4591488324 1.4538743316293
15.75 0.000547741723244913 -8478.19656335514 -10935.5283521724 29.49934686 24560.4591488324 1.4538743316293
16 0.0006980055037539 -8482.3480800735 -10696.3698486076 29.49934686 24560.4591488324 1.4538743316293

bernstei commented 4 months ago

That just sounds like an unstable model, and when the atoms explode the cell or some neighbor list or something becomes huge and it crashes. I've had similar problems fine-tuning MP0, although a colleague here is having better luck. In our case it seems to depend on how close our DFT parameters are to the ones that MPtrj used.

mhsiron commented 4 months ago

I see, thanks @bernstei, I get similar results for DFT trained L0 model, without starting from MP-0. Any recommendations for what kind of data to include to help make the model more stable, should I force some dimer vs. distance in my dataset? Or is it a problem of not training enough?

As for L>0 model with ~1000 atoms is that just unfeasible with an 80GB graphic card from memory standpoint?

mhsiron commented 4 months ago

I should also add with the previous example, the T drop does not necessarily lead to simulation crashing then; and for the same network, the crash does not necessarily come with a sudden T drop either. The structure doesn't indicate anything peculiar either, there is no super close atoms, there is no volume/force implosion. The same input script can lead to crashing at different time steps.

12 1425.31323684682 -8181.87041265065 -666.217691988384 29.49934686 24560.4591488324 1.4538743316293
12.25 1431.21964335045 -8191.86786235401 -2823.12112557144 29.49934686 24560.4591488324 1.4538743316293
12.5 0.00157387697505881 -8344.35923152783 -8366.67115057049 29.49934686 24560.4591488324 1.4538743316293
12.75 1.22775118971289e-17 -8326.09875594358 -6949.76472390582 29.49934686 24560.4591488324 1.4538743316293
13 2.28493309030338e-08 -8371.61500883076 -11399.9341038866 29.49934686 24560.4591488324 1.4538743316293
13.25 8.77094472954126e-07 -8391.24708921418 -13039.2885264807 29.49934686 24560.4591488324 1.4538743316293
13.5 5.24128354736295e-06 -8404.93236755041 -13464.1034805214 29.49934686 24560.4591488324 1.4538743316293
13.75 1.67908974908941e-05 -8416.07696635189 -13365.7167324098 29.49934686 24560.4591488324 1.4538743316293
14 3.92758414519387e-05 -8426.00378684878 -13130.2069142645 29.49934686 24560.4591488324 1.4538743316293
14.25 7.64073353514439e-05 -8435.0017582836 -12967.8196410895 29.49934686 24560.4591488324 1.4538743316293
14.5 0.00013213828234377 -8443.90439332092 -12188.7756332344 29.49934686 24560.4591488324 1.4538743316293
14.75 0.000265051994964286 -8454.90567015758 -11823.8413844585 29.49934686 24560.4591488324 1.4538743316293
15 0.000267630919262236 -8462.56120861329 -11693.4464379121 29.49934686 24560.4591488324 1.4538743316293
15.25 0.000345222275526478 -8468.46583212703 -11477.1279654604 29.49934686 24560.4591488324 1.4538743316293
15.5 0.00043740866282505 -8473.63262891532 -11208.4617558759 29.49934686 24560.4591488324 1.4538743316293
15.75 0.000547741723244913 -8478.19656335514 -10935.5283521724 29.49934686 24560.4591488324 1.4538743316293
16 0.0006980055037539 -8482.3480800735 -10696.3698486076 29.49934686 24560.4591488324 1.4538743316293
16.25 0.000901826930070773 -8486.26240725677 -10498.6184681285 29.49934686 24560.4591488324 1.4538743316293
16.5 0.00117301642049218 -8490.03795700827 -10331.5334883053 29.49934686 24560.4591488324 1.4538743316293
16.75 0.00154166627110766 -8493.73999867024 -10180.7036119053 29.49934686 24560.4591488324 1.4538743316293
17 0.00206806583224074 -8497.43861375988 -10033.2779155224 29.49934686 24560.4591488324 1.4538743316293
17.25 0.00287840323482185 -8501.22983009579 -9877.7939081323 29.49934686 24560.4591488324 1.4538743316293
17.5 0.00427271873252964 -8505.27391749629 -9704.80646541264 29.49934686 24560.4591488324 1.4538743316293
17.75 0.00663896018741823 -8509.7471013976 -9471.37126467432 29.49934686 24560.4591488324 1.4538743316293
18 0.0112570350367217 -8514.67546145644 -9144.47861455316 29.49934686 24560.4591488324 1.4538743316293
18.25 0.0275419175723079 -8520.64779098596 -8747.62193215947 29.49934686 24560.4591488324 1.4538743316293
18.5 0.298230169317817 -8531.21976964688 -8155.84901776477 29.49934686 24560.4591488324 1.4538743316293
18.75 1377.42958346478 -8149.90292963912 3299.29676770027 29.49934686 24560.4591488324 1.4538743316293
19 1349.90744584849 -8148.02735171268 -3364.58806787548 29.49934686 24560.4591488324 1.4538743316293
19.25 1431.3736665173 -8147.73228027619 1283.91641343253 29.49934686 24560.4591488324 1.4538743316293
19.5 1388.11414468739 -8160.03829785877 -1504.08681731051 29.49934686 24560.4591488324 1.4538743316293
19.75 1404.50542270096 -8173.47840745554 -3058.49880175371 29.49934686 24560.4591488324 1.4538743316293
20 1457.27936083213 -8167.73656168868 2996.79414644363 29.49934686 24560.4591488324 1.4538743316293

bernstei commented 4 months ago

"Same" ... "different time steps": same seed for things like random initial velocities, or are you not being quite that precise when you say "same"?

bernstei commented 4 months ago

The initial drop in T happens very fast, and then reverses very fast, yet the total energy goes down gradually, then jumps up. You should plot T (or kinetic energy) and potential energy for every time step (and preferably also save the trajectory), to see in detail what's happening during that T drop. Dropping from T= 1400 K to << 1 K seems essentially impossible to me, just for thermodynamic/stat mech reasons. You can get a T drop if you get a phase transition to a higher E phase (higher potential, so lower kinetic, energy), but that's not the usual behavior anyway (basically an endothermic reaction, so has to be entropy driven), and even if that's what was happening I don't see how it can absorb 99.9% of the KE.

mhsiron commented 4 months ago

Understood -- I will generate additional training data and compare performance. Will report back if it fixes the problem.

bernstei commented 4 months ago

I didn't mean additional training data (although presumably that'll help stability). I meant looking in more detail at this LAMMPS run test, to see how it changes during the weird trajectory. Maybe independently calculate the potential energies in the configuration before/after the T drop (which is presumably associated with a PE increase, assuming total energy is conserved, at least roughly).

owen-rett commented 4 months ago

Spent a while tinkering with lammps settings; it seems the issue appears when I perform Langevin dynamics without zeroing the random force (the default). That is, combining "fix nve" and "fix langevin zero no". This seems to happen regardless of system temperature; I've tried running the system at 70 K, at 800 K, and at 1800 K, and in all cases running Langevin dynamics without zeroing of the random force results in a crash within ~2000-10000 timesteps (with a 1 fs timestep).

That said, turning on zeroing of the random force (fix langevin zero yes) seems to get rid of the crashing issue completely. This is probably a smarter choice to do in general (running with zero no was a mistake I made when setting up the lammps script above), regardless of the crashing issue, but just wanted to note this down here in case anyone in the future has the same issue. I've run a few different MACE-trained potentials, albeit trained on the same dataset, with different choices of L, cutoff, and number of irreps, and all seem to crash at around the same timestep. It does seem possible that the potentials I am using reach instability, and the crash results from that, however I've not been able to throw the configurations into DFT yet. I'll try to perform some additional testing, specifically targeting systems where DFT is tractable once I have free GPU resources next week.

I've been using the mace potential with "no domain decomposition" a single 32 GB VRAM GPU, and have checked system sizes between 384 and 2592 atoms. GPU RAM usage during "standard" molecular dynamics varies between 8 GB and 30 GB depending on model parameter choice, and system size.

Running dynamics with a Nose-Hoover thermostat also seems to completely remove the crashing issue, although thermostat choice is obviously dependent on the variables that are being measured, so may not be an acceptable solution in all cases.

bernstei commented 4 months ago

If you don't zero the forces the system will presumably drift. I'm not sure how the positions are processed by LAMMPS before passing to torch, and where the neighbor list happens, but is it possible that the "raw" positions end up very large, and the neighbor list code is doing something silly, e.g. trying to create bins for a very large apparent box (even though if wrapped by the pbcs they'd all be reasonable)?

mhsiron commented 4 months ago

To follow up all, and thanks for your help, adding additional training data (I added dimers vs. distance) did make the exact same input script work with no sudden temperature drop. I have not had time to check the PE vs. TE yet for the run that did fail, but I did have time to notice that the crashed occurred whenever two atoms got closer than <0.5A. In terms of L1, changing the page size + max neighbor size does also help with memory usage.

To recap -

L1 model crashed due to running out of memory on A100 80GB GPUs with system size > 1200 atoms on default neighbor/atom, page size settings. Lowering atom size or page/neighbor/atom size made L1 models run on 80GB.
L0 model crashed due to unstable model when atoms got too close together. Adding additional dimer data to L0 model made the script successfully run.

gabor1 commented 4 months ago

The drift is probably responsible for the weird temperature, no? the temperature calculation is based on the atomic velocities.

bernstei commented 4 months ago

The drift is probably responsible for the weird temperature, no? the temperature calculation is based on the atomic velocities.

That can be an issue, but I don't see how it could lead to it dropping from 1400 K to 0.005 K ever (and especially not over a single print interval). And a couple of intervals later to 1e-17 K.

gabor1 commented 4 months ago

true

owen-rett commented 4 months ago

Sorry, I think the drift is only happening in my case, cannot speak for mhsiron's case where a major temperature drop occurs. I've seen major temperature drops in simulations where the system is blowing up due to model instability, and seemingly the lammps thermostat is desperately trying to get the atom velocities under control.

gabor1 commented 4 months ago

sorry - confused the two issues. so your crashes get resolved by zeroing the Langevin force sum? I would still like to understand what happens when you have drift, and why that leads to crashes.

bernstei commented 4 months ago

I would still like to understand what happens when you have drift, and why that leads to crashes.

I agree. Does anyone know where the code called by LAMMPS gets its neighbor list? Is it from LAMMPS, or does it do its own (when domain decomposition is off, at least)? If the latter, that'd be my first suspect.

owen-rett commented 4 months ago

As far as I can tell zeroing the Langevin force sum has completely gotten rid of the crashing issue. I don't know the internals of MACE well enough to really speculate on why this is happening, but I've not seen any crashes yet.

bernstei commented 4 months ago

Do you have a trajectory from a run that crashed, so we can check if the atoms are drifting?

owen-rett commented 4 months ago

I don't have one on hand; but can generate one by tomorrow.

bernstei commented 4 months ago

Thanks. I think that'd be useful. Do we think it'd be simpler if we moved @owen-rett 's problem to a new issue [edited]

owen-rett commented 4 months ago

I'll make a new issue real quick; this seems to be getting a bit congested

gabor1 commented 4 months ago

I would still like to understand what happens when you have drift, and why that leads to crashes.

I agree. Does anyone know where the code called by LAMMPS gets its neighbor list? Is it from LAMMPS, or does it do its own (when domain decomposition is off, at least)? If the latter, that'd be my first suspect.

MACE gets its neighbourlist from lammps.

owen-rett commented 4 months ago

I was trying to reproduce the original error using a simplified script, and seem to be getting a separate one, "lost atoms", which makes a lot more sense. Regardless, ensuring that the random forces sum to zero seems to be best practices. I've attached a trajectory where this happens to this reply, but given that I can't rule out incompleteness in the training set being the root cause, I don't think I can call it an issue with system drift necessarily.

dump.lammpstrj.txt

I think I'll just chalk this one up to either a few mistakes in my input script, model instability, or a combination of both, and work on fixing both problems. If the same issue appears again I'll make a proper issue about it and try to document it more fully. I don't have a good reason why setting the sum of random forces to zero seems to cause trajectories to run without issue, but given that I can't rule out an issue with my model, I'm inclined to put blame on that.

EDIT:

Examining the trajectory, it seems that a Zirconium ion got quite close to another Zirconium ion, which likely is what is causing the blow up; I think I'm more inclined to blame model instability in this case. Again I cannot say why zeroing the sum of random forces seems to prevent this issue. I ran an exact copy of this simulation with the forces zero'd and didn't find this issue. My apologies for not examining the trajectory more closely. The original problem happened in a thermodynamic integration simulation where I was not saving lammps trajectories in order to save disc space.

Edit 2: My current suspicion is that something to do with boundary crossing is going wrong, and placing a Zirconium near a Zirconium, which then causes my model to get annoyed, and this then caused the memory issues above.

gabor1 commented 4 months ago

Our experience so far is that if you start from a reasonable configuration (ambient pressure and temperature) then MD will not make things blow up. I'm very interested in cases where MD blows up. We are working on a fix that ensures correct atom-atom repulsion for close distances regardless of conditions. If the atoms got close because of some silliness to do with initial conditions, or you are doing random structure search or similar, you might cope with it better by doing a few steps or relaxation with a purely repulsive (or LJ) model, before you turn on MACE-MP-0 - it really depends on your application.

bernstei commented 4 months ago

If it's a generic atoms getting too close issue, I don't see how zeroing the total force could make a difference. Would you be able to put together a complete reproducing example (LAMMPS input files + model file) ? Even if we have to run it, I think it's important to figure out whether (and if so why) it's happening when the forces are raw but not when they are zeroed.

owen-rett commented 4 months ago

I'll should have some free GPU resources early next week. Going to try to put together an example using a potential I trained myself, and then see if the error reappears using a MACE-0 model.

wcwitt commented 4 months ago

Thanks both @owen-rett and @mhsiron for sticking with this.

mhsiron commented 4 months ago

Hi all,

If it is of interest I am happy to provide an input script + trained MACE potential which starts from a stable atomic configuration and in the end has all atoms converge like so: MicrosoftTeams-image

This was ultimately the structure which caused my network/simulation to exhibit the memory error above. It is started by two atoms getting too close during the simulation and was fixed by adding a couple of additional structures of very close atoms in training my network.

wcwitt commented 4 months ago

Hi @mhsiron I just read through everything again. This summary from you is very helpful

To follow up all, and thanks for your help, adding additional training data (I added dimers vs. distance) did make the exact same input script work with no sudden temperature drop. I have not had time to check the PE vs. TE yet for the run that did fail, but I did have time to notice that the crashed occurred whenever two atoms got closer than <0.5A. In terms of L1, changing the page size + max neighbor size does also help with memory usage. To recap -

L1 model crashed due to running out of memory on A100 80GB GPUs with system size > 1200 atoms on default neighbor/atom, page size settings. Lowering atom size or page/neighbor/atom size made L1 models run on 80GB.

L0 model crashed due to unstable model when atoms got too close together. Adding additional dimer data to L0 model made the script successfully run.

and I don't think we need your trajectory. I'm still a bit surprised about the L1 failure with 1200 atoms, but helpful to know about your neighbor list experiments.

In contrast, I don't think we have a good explanation yet for @owen-rett's problem. We can move to a new issue or continue here - either way.

owen-rett commented 4 months ago

Ok, I've performed a number of runs, 4 each for the Master branch and Repulsion branch of MACE. I have been using the Repulsion branch for primary use, as I find it is a bit more stable in general when performing NEB calculations, however I am experiencing crashing regardless of which branch I use. I performed 2 runs using Langevin dynamics without zeroing of the random force, 1 run using Langevin dynamics with zeroing of the random force, and one using Nose-Hoover dynamics. The Langevin dynamics without zeroing of the random force all encounter a major error and crash between zero and 35k timesteps (at 1 fs timestep), and with zeroing of the random force I still get crashing, albeit typically at 180k timesteps. The Nose-Hoover dynamics does not seem to crash, although I only ran the simulations for 200k timesteps.

I've attached three tar files representing Langevin dynamics with zeroing turned off, and turned on for the master branch, as well as a simulation run using Nose-Hoover dynamics. I can do so with all of the directories, however github is getting annoyed at me due to filesizes (saved every 10 steps to try to catch where the failure happens). I've not included the mace model in the uploads but can email it if necessary.

Langevin_Zero_No_A1.tar.gz Langevin_Zero_Yes_A1.tar.gz Nose_Hoover_A1.tar.gz

Edit: Comment I still cannot rule out that the error springs from model quality, however I've seen few issues running high temperature (up to 2500 K) dynamics, even when going out to 0.5 ns, which I would expect to show model errors more clearly than relatively low temperature fixed cell dynamics. I've also seen similar crashing issues even when running at temperatures as low as 70 K, although only when using fixed unit cells (NVT style dynamics).

gabor1 commented 4 months ago

did you get the same failures using the repulsion branch as well ?

owen-rett commented 4 months ago

Sorry, forgot to say, but seeing the same errors on both branches, at similar timesteps. I've attached an example of a failing run from the Repulsion branch below. Langevin_Zero_No_A1_Rep.tar.gz

gabor1 commented 4 months ago

I downloaded the langevin_zero file, there are only 504 frames, and no crash visible (the atoms look perfectly normal in their positions)

owen-rett commented 4 months ago

That's what's confusing me. The atoms don't seem to be getting close enough to trigger memory-related crash, and each run is only using ~8 GB out of 32 GB for the GPU during normal use, but e.g. on the langevin_zero simulation I am still getting the following. "an illegal memory access was encountered /home/gridsan/orettenmaier/Lammps_MACE/lammps/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:161...."

gabor1 commented 4 months ago

yes I see the crash in the stderr, but I don't think it can be related to the physics of the simulation (like colliding atoms and such like)

owen-rett commented 4 months ago

A part of me wonders if this is related to the compilation of lammps I'm using. I have a few extra packages turned on, so I'll quickly recompile it with those turned off and report back.

owen-rett commented 4 months ago

Ok, after recompiling in a fresh build directory with more basic settings the errors have disappeared, or at least haven't appeared 60k steps into a run that was crashing at 5k steps before recompilation. As such I think the issue was down to the compilation. I'll begin adding in packages and checking for instability.

My current suspicion is down to the fact that I had tried to turn on Kokkos-UVM in the past, under the idea that I could squeeze a few more atoms into a simulation. The lammps binary used for the above does not have the Kokkos-UVM option turned on, however was still compiled in the same directory as when I tried to do so (recompiling using cmake . -D Kokkos_ENABLE_CUDA_UVM=no ../cmake). I can't think of a reason for any of the other packages I used to cause memory issues, being MISC and Extra-Fix.

If crashing issues reappear for either MISC or Extra-fix I'll report here, but I suspect this is down to the lammps compilation flags I had used.

Edit: You have my apologies, this should have been one of the first things I checked.

Its still baffling that the error only appeared when performing Langevin dynamics using specifically NVT style dynamics, and never seems to appear using Nose-Hoover dynamics using NVT or NPT ensembles, nor when performing Langevin style dynamics using an NPT ensemble.

ACEsuit / mace

LAMMPS-MACE with Kokkos: Illegal memory access encountered with trained MACE but works with MP-MACE-0 #321