Colvars / colvars

Collective variables library for molecular simulation and analysis programs
http://colvars.github.io/
GNU Lesser General Public License v3.0
212 stars 57 forks source link

Program segfaulting after simulation runs #588

Closed wisecashew closed 1 year ago

wisecashew commented 1 year ago

I am running a free energy calculation on Rg for a polymer in water in LAMMPS using the COLVAR package. It is an NPT simulation with intel acceleration with an ABF acting on Rg.

I am seeing an error which seems to take place AFTER the simulation is done running. I don’t understand why this ought to happen. I have attached my simulation output. This is the final output message:

Ave neighs/atom = 373.35589
Ave special neighs/atom = 2.1235554
Neighbor list builds = 509 
Dangerous builds = 0 
colvars: Resetting the Collective Variables module.
Total wall time: 0:00:49
srun: error: stellar-i10n4: tasks 1-10,12,15,17-20,25,27,29,31-32,34-35,37-40,42-43,46-58,60-65,67-71,73-85,88-89,92-95: Segmentation fault (core dumped)
srun: Terminating StepId=968088.0
slurmstepd: error: *** STEP 968088.0 ON stellar-i10n4 CANCELLED AT 2023-09-19T17:37:58 *** 
srun: error: stellar-i10n4: tasks 0,11,13-14,16,21-24,26,28,30,33,36,41,44-45,59,66,72,86-87,90-91: Terminated
srun: Force Terminated StepId=968088.0

You can see this in the file npt.out.

As you can see, LAMMPS has also reported the total run time, so I assume the simulation has run its course, but then crashes out right after. What could be causing this? I am running the following command on my cluster: srun --ntasks=96 --nodes=1 --cpus-per-task=1 --exclusive lmp_colvar -sf intel -in npt.in > npt.out 2>&1.

where sys.npt.data is my data file, sys.pnipam.water.settings is my settings file, colvars.inp is my colvars input file, and npt.in is my LAMMPS input file. I have attached all my input files to this message. I would appreciate any advice you have for me.

colvars_inputs.zip

giacomofiorin commented 1 year ago

Hi, can you please provide the code versions including how the LAMMPS executable was built?

wisecashew commented 1 year ago

Thank you for your response, @giacomofiorin! Yes, here it is:

#!/bin/bash

VERSION=29Sep2021

echo "deleting old tarball..." 
rm stable_${VERSION}.tar.gz || true 
echo "deleting old lammps build..." 
rm -rf lammps-stable_${VERSION} || true 
echo "now start grabbing tar file from the repo..." 

wget https://github.com/lammps/lammps/archive/stable_${VERSION}.tar.gz
tar zxf stable_${VERSION}.tar.gz
cd lammps-stable_${VERSION}
mkdir build && cd build

module purge
module load intel/19.1.1.217
module load intel-mpi/intel/2019.7

cmake3 -D CMAKE_INSTALL_PREFIX=$HOME/.local.lammps.latest.w.accelrn \
-D CMAKE_BUILD_TYPE=Release \
-D LAMMPS_MACHINE=user_intel \
-D ENABLE_TESTING=yes \
-D BUILD_OMP=yes \
-D BUILD_MPI=yes \
-D CMAKE_C_COMPILER=icc \
-D CMAKE_CXX_COMPILER=icpc \
-D CMAKE_CXX_FLAGS_RELEASE="-Ofast -xHost -DNDEBUG" \
-D PKG_MOLECULE=yes -D PKG_RIGID=yes -D PKG_MISC=yes \
-D PKG_KSPACE=yes -D FFT=MKL -D FFT_SINGLE=yes \
-D PKG_EXTRA-MOLECULE=yes -D PKG_USER-INTEL=yes -D PKG_ASPHERE=yes -D PKG_CLASS2=yes -D PKG_OPENMP=yes -D PKG_OPT=yes -D PKG_EXTRA-DUMP=yes \
-D PKG_COLVARS=yes \
-D PKG_INTEL=yes -D INTEL_ARCH=cpu -D INTEL_LRT_MODE=threads ../cmake

make -j 16
make install

I have attached my LAMMPS executable script (with CMAKE) to this message: stellar_intel_lammps_user_intel.sh.txt

I have added a .txt extension just so it could be pasted here.

akohlmey commented 1 year ago

FWIW, when I run with valgrind using the 2Aug2023 version of LAMMPS I get:

==196831== Conditional jump or move depends on uninitialised value(s)
==196831==    at 0x7F9A390: colvar::periodic_boundaries(colvarvalue const&, colvarvalue const&) const (colvar.cpp:2158)
==196831==    by 0x8087BED: colvar_grid<unsigned long>::init_from_colvars(std::vector<colvar*, std::allocator<colvar*> > const&, unsigned long, bool) [clone .isra.0] (colvargrid.h:299)
==196831==    by 0x8088ECB: colvar_grid (colvargrid.h:258)
==196831==    by 0x8088ECB: colvar_grid_count::colvar_grid_count(std::vector<colvar*, std::allocator<colvar*> >&, unsigned long const&, bool) (colvargrid.cpp:37)
==196831==    by 0x7FDBEEF: colvarbias_abf::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (colvarbias_abf.cpp:193)
==196831==    by 0x8097040: parse_biases_type<colvarbias_abf> (colvarmodule.cpp:497)
==196831==    by 0x8097040: colvarmodule::parse_biases(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (colvarmodule.cpp:523)
==196831==    by 0x8099CCB: colvarmodule::parse_config(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (colvarmodule.cpp:278)
==196831==    by 0x809A188: colvarmodule::read_config_file(char const*) (colvarmodule.cpp:210)
==196831==    by 0x80BA370: colvarproxy::parse_module_config() (colvarproxy.cpp:531)
==196831==    by 0x60AC9D8: LAMMPS_NS::FixColvars::one_time_init() (fix_colvars.cpp:448)
==196831==    by 0x60AD030: LAMMPS_NS::FixColvars::setup(int) (fix_colvars.cpp:519)
==196831==    by 0x5DD0487: LAMMPS_NS::Modify::setup(int) (modify.cpp:310)
==196831==    by 0x5F55D89: LAMMPS_NS::Verlet::setup(int) (verlet.cpp:159)
==196831==  Uninitialised value was created by a heap allocation
==196831==    at 0x4841FB5: operator new(unsigned long) (vg_replace_malloc.c:472)
==196831==    by 0x809585A: colvarmodule::parse_colvars(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (colvarmodule.cpp:422)
==196831==    by 0x8099CB2: colvarmodule::parse_config(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (colvarmodule.cpp:273)
==196831==    by 0x809A188: colvarmodule::read_config_file(char const*) (colvarmodule.cpp:210)
==196831==    by 0x80BA370: colvarproxy::parse_module_config() (colvarproxy.cpp:531)
==196831==    by 0x60AC9D8: LAMMPS_NS::FixColvars::one_time_init() (fix_colvars.cpp:448)
==196831==    by 0x60AD030: LAMMPS_NS::FixColvars::setup(int) (fix_colvars.cpp:519)
==196831==    by 0x5DD0487: LAMMPS_NS::Modify::setup(int) (modify.cpp:310)
==196831==    by 0x5F55D89: LAMMPS_NS::Verlet::setup(int) (verlet.cpp:159)
==196831==    by 0x5EE753E: LAMMPS_NS::Run::command(int, char**) (run.cpp:171)
==196831==    by 0x5D3DA4C: LAMMPS_NS::Input::execute_command() (input.cpp:868)
==196831==    by 0x5D3E67D: LAMMPS_NS::Input::file() (input.cpp:313)

This can be easily silenced by this change:

  diff --git a/lib/colvars/colvar.cpp b/lib/colvars/colvar.cpp
  index 700d3752ac..0cb5c1ebdb 100644
  --- a/lib/colvars/colvar.cpp
  +++ b/lib/colvars/colvar.cpp
  @@ -30,6 +30,7 @@ colvar::colvar()
     after_restart = false;
     kinetic_energy = 0.0;
     potential_energy = 0.0;
  +  period = 0.0;

   #ifdef LEPTON
     dev_null = 0.0;

but I do not get a segmentation fault before or after this change.

giacomofiorin commented 1 year ago

Thanks for the quick diagnosis @akohlmey!

Interestingly, this is one of the very oldest classes and the missing initialization went undetected all this time.