SouthernMethodistUniversity / m2_examples

Application-Specific Job Scripts for ManeFrame II
0 stars 0 forks source link

NAMD Swarm of Trajectories #1

Open rkalescky opened 4 years ago

rkalescky commented 4 years ago

This doesn't appear to work in parallel on M2.

robertgortega commented 4 years ago

Downloaded NAMD 2.14 source to home directory on M2. Compiled NAMD according to NAMD 2.14 Release Notes for: single-node multicore, ethernet, InfiniBand verbs, InfiniBand UCX OpenMPI, and MPI versions for testing Issue with InfiniBand UCX OpenMPI version - debugging now. Issue resolved with InfinBand UCX version. Turns out you build with OpenMPI module instead of hpcx. OpenMPI has support for UCX built in. Researching various technologies associated/attached to project: Charm++, InfiniBand, UCX, MPI vs Open MP. Researching SLURM, SRUN, SBATCH for tests of multiple node versions of NAMD.

robertgortega commented 4 years ago

Two interesting papers for background info on NAMD and Tutorial on InfiniBand, Verbs, and MPI: This from the NAMD website and just released in August 2020, [Scalable molecular dynamics on CPU and GPU architectures.pdf] Found this doing research on InfiniBand: (https://github.com/SouthernMethodistUniversity/m2_examples/files/5283934/Scalable.molecular.dynamics.on.CPU.and.GPU.architectures.pdf)

robertgortega commented 4 years ago

Error message associated with running the command, mpiexec -n 4 ./pgm. pgm is a build of Charm++ for InfiniBand UCX OpenMPI version. Here is head of message. Rather lengthy but this is where Error is:

[bobo@login04 megatest]$ mpiexec -n 4 ./pgm Charm++> Running in non-SMP mode: 4 processes (PEs) Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556 CharmLB> Load balancer assumes all CPUs are same. Charm++> Running on 1 hosts (2 sockets x 14 cores x 1 PUs = 28-way SMP) Error in `./pgm Error in ./pgm': munmap_chunk(): invalid pointer: 0x*** Error in./pgmCharm++> cpu topology info is gathered in 0.004 seconds. Error in `./pgm': munmap_chunk(): invalid pointer: 0x00007fffffffb520 ======= Backtrace: ========= /lib64/libc.so.6(+0x7f3e4)[0x2aaaacc623e4] ./pgm(_ZNSt10_HashtableIiSt4pairIKiiESaIS2_ENSt8__detail10_Select1stESt8equal_toIiESt4hashIiENS4_18_Mod_range_hashingENS4_20_Default_ranged_hashENS4_20_Prime_rehash_policyENS4_17_Hashtable_traitsILb0ELb0ELb1EEEE9_M_rehashEmRKm+0x91)[0x61a761]

robertgortega commented 4 years ago

This build, Infiniband UCX OpenMPI version, was not correct. Upon attempts at rebuild (GCC, INTEL, all HPC-X modules) none were successful. Opened a ticket with Mellanox and forwarded error messages on attempt at build. Mellanox Tech Support reports that issues appears to be an incompatible HPC-X/UCX version. HPC-X version(s) on M2 is 2.1 (for all modules) but can't confirm. Apparently, support for UCP PUT and GET was added in HPC-X v2.2. Mellanox suggests using HPC-X v2.7 if possible. I believe UCX over Infiniband isn't a priority, but I investigated due to the failure of not being able to build this version.

rkalescky commented 4 years ago

What build commands are you using for the MPI variant?

robertgortega commented 4 years ago

env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --with-production for the build #(add icc option for INTEL)

mpiexec -n 4 ./pgm (pgm is set of 52 tests to make sure NAMD/Charm built ok)

rkalescky commented 4 years ago

Which compiler suite?

robertgortega commented 4 years ago

Both gcc-9.2 and intel-2020.0

robertgortega commented 4 years ago

Going to forward to you my discussion with Aleksey Senin of Mellanox about Case No 00905512 via email.

./build charm++ ucx-linux-x86_64 slurmpmi2 --with-production

for InfinixBand UCX OpenMPI version didn't build. It errors out in the same place trying to get to ucp_put and ucp_get. Detail error messages in the email.

robertgortega commented 4 years ago

I believe we may have something to work with now. I've been able to successfully run MPI ICC based versions of NAMD with the following command using srun,

srun -n x -N X -p standard-mem-s --mem=2G ./namd2 src/alanin (66 atom config file) srun -n x -N X -p standard-mem-s --mem=2G ./namd2 apoa1/apoa1.namd (125 molecules, 92224 atoms)

n = 4, 8 and N = 2, 4, 8

I also ran GCC versions to compare outputs. I've captured output with date timestamp and all show "Running on X hosts. Running on X processors, X nodes, X physical nodes" in output files. Significant difference running ICC version vs GCC version when running 125 molecule test.

For example, with n=4, N=4, 125 molecule input file, ICC run "Wall" = 81.29s,"CPU" = 81.29s, MEM = 908.14 MB GCC run "Wall = 138.43s, "CPU" = 138.43s MEM = 898.27 MB with n= 8, N=8, ICC run "Wall" = 56.85s, "CPU" = 56.85s, MEM = 877.41 MB GCC run "Wall" = 82.54s "CPU" = 82.54s, MEM = 871. 63 MB

Note these are runs using standard MPI builds. I'm still am not able to build Infiniband UCX version (Charm library or NAMD) due to errors. I've submitted email to the developers at charm@cs.illinois.edu for support.

rkalescky commented 4 years ago

Here's an sbatch script for one of my attempts at getting this to work:

#!/bin/bash
#SBATCH -J example
#SBATCH -o example.out
#SBATCH -N 2
#SBATCH -p htc

#module purge
#module load namd/2.12/cpu
#module load gcc-4.8.5
#module load openmpi

NAMD_BIN=/users/rkalescky/NAMD_2.13_Linux-x86_64-verbs

# set Lustre striping
#lfs setstripe -c 4 $SLURM_SUBMIT_DIR

# get number of cores/node
N=$(awk '/processor/{a=$3}END{print a}' /proc/cpuinfo)

# generate NAMD nodelist
for n in `echo $SLURM_NODELIST | scontrol show hostnames`; do
    echo "host $n ++cpus $N" >> nodelist.$SLURM_JOBID
done

# calculate total processes (P) and procs per node (PPN)
#PPN=$(($N - 1))
PPN=10
P=$(($PPN * $SLURM_NNODES))

#$NAMD_BIN/charmrun $NAMD_BIN/namd2 ++ppn $PPN ++p $P +replicas $P ++nodelist nodelist.$SLURM_JOBID +setcpuaffinity +isomalloc_sync initial.conf +stdout output/%02d/job00.%02d.log
$NAMD_BIN/charmrun $NAMD_BIN/namd2 ++p $P +replicas $P +setcpuaffinity +isomalloc_sync initial.conf
#charmrun ++verbose ++p $P ++ppn $PPN $(which namd2) ++nodelist nodelist.$SLURM_JOBID +setcpuaffinity +isomalloc_sync initial.conf
#mpirun -np $P $(which namd2) +replicas $P initial.conf +stdout output/%02d/job00.%02d.log

rm nodelist.$SLURM_JOBID