LAMMPS-MACE with Kokkos: Illegal memory access encountered with trained MACE but works with MP-MACE-0

Hello MACE developers, your help would be appreciated:

Describe the bug LAMMPS with ML-MACE crashes at different timestep upon sending the very same input script on the very same hosts, with the same stacktrace of an illegal memory access encountered.

LAMMPS scripts works as expected with MACE-MP-0 L0 trained model provided by this repository. But fails at seemingly random timepoint with trained MACE model. Used same training command as MACE-MP-0 L0 except for distributed/num_workers command.

To Reproduce Steps to reproduce the behavior:

LAMMPS Input script:

variable dt     equal dt
variable time   equal time
variable temp   equal temp 
variable etotal equal etotal
variable press  equal press 
variable lx     equal lx 
variable vol    equal vol 
variable density equal density

read_data               data.init.read
replicate 2 2 2

newton on
pair_style mace no_domain_decomposition
pair_coeff * * model-lammps_L0.pt  Si O C H 

compute         temp    all temp
compute         com         all     com
compute         keatom      all     ke/atom

thermo 10
dump d1                 all atom 10 dumpmin.atom  
minimize                0.0     1.0e-8  5000    100000
undump d1

write_restart   restart.min.ac
write_data      data.min.read

timestep                0.0001
variable tempini        equal 1000
variable tempfin        equal 3000
variable rate           equal 100 #1E-2K/fs
variable nstep          equal "(v_tempfin - v_tempini)/v_rate/v_dt"
variable neverydmp      equal "v_nstep/20"
variable neveryprnt equal "v_nstep/200"
variable vscale         equal 1.0
#print "${neverydmp}"

#------------------------------------------------------------------------------------------------------------
# Temperature ramp
reset_timestep  0
velocity                all     create ${tempini} 142857        mom yes rot yes dist gaussian
#fix fi3                all print ${neveryprnt} "${time} ${temp} ${etotal} ${press} ${lx} ${vol} ${density}"  screen no append thermovals.dat
fix fi3                 all print 100 "${time} ${temp} ${etotal} ${press} ${lx} ${vol} ${density}"  screen no append thermovals.dat
fix fi2                 all deform 1 x scale ${vscale} y scale ${vscale} z scale ${vscale} remap none
fix fi1                 all     nvt temp ${tempini} ${tempfin} 0.010
dump d1                 all atom ${neverydmp} dumpmeltramp.atom  
thermo ${neverydmp}
thermo_style    custom  step temp lx ly lz etotal pxx pyy pzz
run ${nstep}
unfix fi1
unfix fi2
unfix fi3
undump d1

write_restart   restart.meltramp.ac

Input files below (initial structure, LAMMPS trained model): input_files.zip

Stacktrace:

cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /tools/mace/lammps/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:161
Backtrace:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Kokkos::Impl::save_stacktrace() [0x2aaaabdf51a5]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Kokkos::Impl::traceback_callstack(std::ostream&) [0x2aaaabdec65a]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Kokkos::Impl::host_abort(char const*) [0x2aaaabdec6bb]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Kokkos::Impl::cuda_internal_error_abort(cudaError, char const*, char const*, int) [0x2aaaabdfa74a]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Kokkos::Impl::cuda_device_synchronize(std::string const&) [0x2aaaabdfa811]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Kokkos::Impl::ExecSpaceManager::static_fence(std::string const&) [0x2aaaabdd4395]
void Kokkos::deep_copy<int, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks, int, Kokkos::LayoutLeft, Kokkos::Cuda, void>(Kokkos::View<int, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks> const&, Kokkos::View<int, Kokkos::LayoutLeft, Kokkos::Cuda, void> const&, std::enable_if<((std::is_void<Kokkos::ViewTraits<int, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>::specialize>::value&&std::is_void<Kokkos::ViewTraits<int, Kokkos::LayoutLeft, Kokkos::Cuda, void>::specialize>::value)&&(((unsigned int)Kokkos::ViewTraits<int, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>::rank)==((unsigned int)(0))))&&(((unsigned int)Kokkos::ViewTraits<int, Kokkos::LayoutLeft, Kokkos::Cuda, void>::rank)==((unsigned int)(0))), void>::type*) [0x2aaaab5ccef6]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         LAMMPS_NS::NBinKokkos<Kokkos::Cuda>::bin_atoms() [0x2aaaab6b947b]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          void LAMMPS_NS::NeighborKokkos::build_kokkos<Kokkos::Cuda>(int) [0x2aaaab65eb61]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        LAMMPS_NS::VerletKokkos::run(int) [0x2aaaab98bd5c]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     LAMMPS_NS::Run::command(int, char**) [0x2aaaab4cf6bb]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      LAMMPS_NS::Input::execute_command() [0x2aaaab364765]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 LAMMPS_NS::Input::file() [0x2aaaab364a4d]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              LAMMPS_NS::Input::include() [0x2aaaab364ffd]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      LAMMPS_NS::Input::execute_command() [0x2aaaab363f37]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 LAMMPS_NS::Input::file() [0x2aaaab364a4d]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                [0x40473a]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        __libc_start_main [0x2aaaaefe7ac5]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                [0x4048d9]
[scff292100:22082] *** Process received signal ***
[scff292100:22082] Signal: Aborted (6)
[scff292100:22082] Signal code:  (-6)
[scff292100:22082] [ 0] /lib64/libpthread.so.0(+0x11ce0)[0x2aaaaec52ce0]
[scff292100:22082] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaaaeffc0c7]
[scff292100:22082] [ 2] /lib64/libc.so.6(abort+0x13a)[0x2aaaaeffd49a]
[scff292100:22082] [ 3] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN6Kokkos4Impl17human_memory_sizeEm+0x0)[0x2aaaabdec6c0]
[scff292100:22082] [ 4] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN6Kokkos4Impl25cuda_internal_error_abortE9cudaErrorPKcS3_i+0xea)[0x2aaaabdfa74a]
[scff292100:22082] [ 5] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN6Kokkos4Impl23cuda_device_synchronizeERKSs+0xb1)[0x2aaaabdfa811]
[scff292100:22082] [ 6] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN6Kokkos4Impl16ExecSpaceManager12static_fenceERKSs+0x25)[0x2aaaabdd4395]
[scff292100:22082] [ 7] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN6Kokkos9deep_copyIiJNS_10LayoutLeftENS_6DeviceINS_6OpenMPENS_9HostSpaceEEENS_12Experimental14EmptyViewHooksEEiJS1_NS_4CudaEvEEEvRKNS_4ViewIT_JDpT0_EEERKNS9_IT1_JDpT2_EEEPNSt9enable_ifIXaaaaaasrSt7is_voidINS_10ViewTraitsISA_JSC_EE10specializeEE5valuesrSN_INSO_ISG_JSI_EE10specializeEE5valueeqcvjsrSP_4rankcvjLi0EeqcvjsrSS_4rankcvjLi0EEvE4typeE+0x196)[0x2aaaab5ccef6]
[scff292100:22082] [ 8] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS10NBinKokkosIN6Kokkos4CudaEE9bin_atomsEv+0x143b)[0x2aaaab6b947b]
[scff292100:22082] [ 9] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS14NeighborKokkos12build_kokkosIN6Kokkos4CudaEEEvi+0x1d1)[0x2aaaab65eb61]
[scff292100:22082] [10] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS12VerletKokkos3runEi+0x124c)[0x2aaaab98bd5c]
[scff292100:22082] [11] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xd1b)[0x2aaaab4cf6bb]
[scff292100:22082] [12] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0xd65)[0x2aaaab364765]
[scff292100:22082] [13] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x15d)[0x2aaaab364a4d]
[scff292100:22082] [14] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS5Input7includeEv+0xed)[0x2aaaab364ffd]
[scff292100:22082] [15] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0x537)[0x2aaaab363f37]
[scff292100:22082] [16] /tools/mace/lammps/build-kokkosA100V100/liblammps.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x15d)[0x2aaaab364a4d]
[scff292100:22082] [17] lmp[0x40473a]
[scff292100:22082] [18] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaaefe7ac5]
[scff292100:22082] [19] lmp[0x4048d9]
[scff292100:22082] *** End of error message ***
/data/relax_rdf_reaxff_si16o25c15_ens-0_qu-10Kpps_3000K1.4DEN/run.sh: line 18: 22082 Aborted                 (core dumped) lmp -k on g 1 -sf kk -in in.rdf

System setup (please complete the following information):

OS: SLES12
LAMMPS compile command: cmake -C ../cmake/presets/kokkos-cuda.cmake ../cmake -DPKG_KOKKOS=ON -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_AMPERE80=yes -DBUILD_OMP=yes -D BUILD_MPI=yes -D BUILD_SHARED_LIBS=yes -D LAMMPS_EXCEPTIONS=yes -D PKG_OPENMP=yes -D PKG_OPENMP=yes -D Kokkos_ENABLE_OPENMP=yes -D Kokkos_ENABLE_CUDA=yes -DCUDATOOLKIT_ROOT_DIR=/usr/local/cuda-11.6 -DKokkos_ARCH_PASCAL60=no -DCMAKE_PREFIX_PATH=/tools/pytorch/torch/share/cmake/ -DCMAKE_CXX_COMPILER=/tools/mace/lammps/lib/kokkos/bin/nvcc_wrapper -D Kokkos_ARCH_AMDAVX=yes -D Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=no -D Kokkos_ENABLE_CUDA_UVM=no -D PKG_ML-MACE=yes
- GCC 8.2
- CUDA 11.6
- A100 GPUs
- PyTorch 1.13.1-rc1 compiled (pre-compiled zip file doesn't work with SLES12 due to old GLIBC version)

Additional context Simulation is completely stable with pre-trained OS. There is a possibility that the trained model file has been trained on a different PyTorch version.

ACEsuit / mace

LAMMPS-MACE with Kokkos: Illegal memory access encountered with trained MACE but works with MP-MACE-0 #321