SSAGESproject / SSAGES

Software Suite for Advanced General Ensemble Simulations
GNU General Public License v3.0
81 stars 28 forks source link

Forward flux sampling crashing when N_walker < N_procs #10

Open ohenrich opened 5 years ago

ohenrich commented 5 years ago

Dear SSAGES Developers,

I started with the FFS tutorial in /Examples/User/ForwardFlux/LAMMPS/Langevin/ and get a fatal error with missing atoms when the number of processors is larger (but divisible) by the number of walkers, so e.g. 1 walker on 2 processes.

The error occurs because of line #354 in /src/Methods/ForwardFlux.cpp. The default atom index, set in line #323 to be -1, is not overwritten, so is negative as the atom cannot be found.

There is a comment in line #342 and below reading //FIXME: try using snapshot->GetLocalIndex() //copied from Ben's previous implementation Does this suggest this has been taken from another implementation and perhaps doesn't work as expected?

I attach a zip archive with all input files, etc for a run with 1 walker on 2 MPI-tasks, which should allow you to reproduce the issue. I modified the example in the tutorial in a logical way to decrease the number of walkers from two to one. I don't think I made a mistake here, but please check this first. It is not obvious to me what could have gone wrong.

I also attach the stdout from the configuration and installation step, which should allow you to see which MPI-library etc I have used. I could reproduce the issue on a completely different system. In all instances LAMMPS works fine in parallel and SSAGES was built with a copy of that distribution.

I'm happy to help with the fix, but would require more input as I'm obviously not very familiar with the code. I understand how difficult it is to find the time to help others as I'm doing the same on a number of projects. So your help and time is very much appreciated.

Best wishes, Oliver

1walker2proc.zip cmake_config.txt build.txt

jhelfferich commented 5 years ago

Hi Oliver,

I haven't tried to reproduce your error using your input files yet, but one thing that has caught my eye is that you have a linker error in your build.txt. It fails linking to the MPI library which could be the cause of your troubles. To debug this further, you can do a verbose build with

make VERBOSE=1

Also, the output of the following commands would be helpful to understand if it is a compiling/linking problem or a bug in the source code:

ld ssages
which mpirun
mpirun --version
ohenrich commented 5 years ago

Hi Julien,

Many thanks for your response. I posted earlier today, but deleted the comment as I've got an update.

There is definitely a problem with macports and its openmpi library. When using the standard library (e.g. openmpi-devel-default) I get even a fatal error during the linking stage and no executable. This is why I decided to move to our local HPC system where openmpi and gcc were compiled and installed from source.

Although no warning or error occurs now during the build process (see I attach files camke_config_archie.txt and build_archie.txt) the error is persistent.

I attach as well the stdout from the runs with 2 walkers on 2 processes (slurm-313283.out) and 1 walker on 2 processes (slurm-313284.out).

The output of the above commands is

[xwb17127@archie-e Langevin]$ ld ssages
ld: error in ssages(.eh_frame); no .eh_frame_hdr table will be created.
[xwb17127@archie-e Langevin]$ which mpirun
/opt/software/eb/software/OpenMPI/2.1.2-GCC-6.4.0-2.28/bin/mpirun
[xwb17127@archie-e Langevin]$ mpirun --version
mpirun (Open MPI) 2.1.2

Let me know if there is anything else I can do.

Best wishes, Oliver

cmake_config_archie.txt build_archie.txt slurm-313283.txt slurm-313284.txt

jhelfferich commented 5 years ago

Hi Oliver,

the first thing I noticed is that there is a problem in the CMakeLists.txt compile instructions. You can remove the following line:

link_directories(${MPI_CXX_LIBRARIES})

This command leads to inconsistent linker flags and, depending on how strict your linker is, could lead to an error. Maybe this will fix the compilation on your Mac.

Also, I made a typo in my previous comment. Pleas give the output of

ldd ssages

(with two d).

Concerning the error could not locate atomID 1 from dumpfile, I could reproduce the problem and will look into it further.

ohenrich commented 5 years ago

Hi Julian,

Removing the above line in CMakeLists.txt on my Mac does indeed get rid of the last linking error. However, this does not affect the result of the test run on my Mac (still crashes for 2 processes with 1 walker).

Regarding the output of ldd sages on the HPC facility, I get the following in the build directory:

linux-vdso.so.1 =>  (0x00007ffd85786000)
liblammps_mpi.so => /users/xwb17127/work/lammps.ssages/src/liblammps_mpi.so (0x00002af2b279c000)
libmpi.so.20 => /opt/software/eb/software/OpenMPI/2.1.2-GCC-6.4.0-2.28/lib/libmpi.so.20 (0x00002af2b2c21000)
libstdc++.so.6 => /opt/software/eb/software/GCCcore/6.4.0/lib64/libstdc++.so.6 (0x00002af2b2e4f000)
libm.so.6 => /usr/lib64/libm.so.6 (0x00002af2b2ff5000)
libgcc_s.so.1 => /opt/software/eb/software/GCCcore/6.4.0/lib64/libgcc_s.so.1 (0x00002af2b32f7000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00002af2b330f000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00002af2b352c000)
libgpfs.so => /usr/lib64/libgpfs.so (0x00002af2b38ef000)
libpsm2.so.2 => /usr/lib64/libpsm2.so.2 (0x00002af2b3b04000)
libopen-rte.so.20 => /opt/software/eb/software/OpenMPI/2.1.2-GCC-6.4.0-2.28/lib/libopen-rte.so.20 (0x00002af2b3d90000)
libopen-pal.so.20 => /opt/software/eb/software/OpenMPI/2.1.2-GCC-6.4.0-2.28/lib/libopen-pal.so.20 (0x00002af2b3eb6000)
libfabric.so.1 => /usr/lib64/libfabric.so.1 (0x00002af2b4049000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00002af2b434c000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002af2b4563000)
librt.so.1 => /usr/lib64/librt.so.1 (0x00002af2b4778000)
libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002af2b4981000)
libhwloc.so.5 => /opt/software/eb/software/hwloc/1.11.8-GCCcore-6.4.0/lib/libhwloc.so.5 (0x00002af2b4b84000)
libnuma.so.1 => /opt/software/eb/software/numactl/2.0.11-GCCcore-6.4.0/lib/libnuma.so.1 (0x00002af2b4bc1000)
/lib64/ld-linux-x86-64.so.2 (0x00005654fcde2000)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00002af2b4bce000)
libnl-route-3.so.200 => /usr/lib64/libnl-route-3.so.200 (0x00002af2b4dd2000)
libpsm_infinipath.so.1 => /usr/lib64/libpsm_infinipath.so.1 (0x00002af2b5040000)
libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x00002af2b5296000)
libinfinipath.so.4 => /usr/lib64/libinfinipath.so.4 (0x00002af2b54b8000)
libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00002af2b56c7000)

In the example directory /ssages/Examples/User/ForwardFlux/LAMMPS/Langevin some addresses are different:

linux-vdso.so.1 =>  (0x00007ffe240a8000)
liblammps_mpi.so => /users/xwb17127/work/lammps.ssages/src/liblammps_mpi.so (0x00002b451a5ad000)
libmpi.so.20 => /opt/software/eb/software/OpenMPI/2.1.2-GCC-6.4.0-2.28/lib/libmpi.so.20 (0x00002b451aa32000)
libstdc++.so.6 => /opt/software/eb/software/GCCcore/6.4.0/lib64/libstdc++.so.6 (0x00002b451ac60000)
libm.so.6 => /usr/lib64/libm.so.6 (0x00002b451ae06000)
libgcc_s.so.1 => /opt/software/eb/software/GCCcore/6.4.0/lib64/libgcc_s.so.1 (0x00002b451b108000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00002b451b120000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00002b451b33d000)
libgpfs.so => /usr/lib64/libgpfs.so (0x00002b451b700000)
libpsm2.so.2 => /usr/lib64/libpsm2.so.2 (0x00002b451b915000)
libopen-rte.so.20 => /opt/software/eb/software/OpenMPI/2.1.2-GCC-6.4.0-2.28/lib/libopen-rte.so.20 (0x00002b451bba1000)
libopen-pal.so.20 => /opt/software/eb/software/OpenMPI/2.1.2-GCC-6.4.0-2.28/lib/libopen-pal.so.20 (0x00002b451bcc7000)
libfabric.so.1 => /usr/lib64/libfabric.so.1 (0x00002b451be5a000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00002b451c15d000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002b451c374000)
librt.so.1 => /usr/lib64/librt.so.1 (0x00002b451c589000)
libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002b451c792000)
libhwloc.so.5 => /opt/software/eb/software/hwloc/1.11.8-GCCcore-6.4.0/lib/libhwloc.so.5 (0x00002b451c995000)
libnuma.so.1 => /opt/software/eb/software/numactl/2.0.11-GCCcore-6.4.0/lib/libnuma.so.1 (0x00002b451c9d2000)
/lib64/ld-linux-x86-64.so.2 (0x000055d8a5653000)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00002b451c9df000)
libnl-route-3.so.200 => /usr/lib64/libnl-route-3.so.200 (0x00002b451cbe3000)
libpsm_infinipath.so.1 => /usr/lib64/libpsm_infinipath.so.1 (0x00002b451ce51000)
libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x00002b451d0a7000)
libinfinipath.so.4 => /usr/lib64/libinfinipath.so.4 (0x00002b451d2c9000)
libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00002b451d4d8000)
mquevill commented 5 years ago

Unfortunately, the current FFS implementation is set up for 1 MPI process per walker. This is a definite shortcoming of this method at this time, and we will most likely be overhauling FFS at some point to make it more rigorous and able to handle more use cases. For some "large" cases, we've seen the method fail to calculate the rate and committor probability automatically. In addition, the filenames for the "failed" trajectories are numbered in a cumulative way, instead of numbered by interface.