charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
202 stars 49 forks source link

NAMD hangs with mpi-smp build on DGX A100 #3519

Open dmclark17 opened 2 years ago

dmclark17 commented 2 years ago

I am observing hangs with NAMD2.15alpha2 and Charm++ 7.0.0 on a DGX A100 node (dual AMD Rome 7742). The issue also happens with Charm++ v7.1.0-devel-89-ge24d2d3ad. I am using the following command to build charm++:

./build charm++ ucx-linux-x86_64 gcc slurmpmi2 smp --with-production --enable-error-checking

And running NAMD with the following command:

srun --mpi=pmi2 --container-image="${CONT}" --container-mount-home --container-mounts=$PWD:$PWD --container-workdir=$PWD bash -c 'source setenv.sh && cd DATA && CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 /namd/namd/Linux-x86_64-g++.smp-cuda-memopt/namd2 +ppn15 +setcpuaffinity +pemap 0-14,16-30,32-46,48-62,64-78,80-94,96-110,112-126 +commap 15,31,47,63,79,95,111,127 20stmv2fs.namd'

The stall seems to happen in _initCharm; here is the backtrace when I attach to one of the processes:

#0  0x000055dca18eda02 in CmiGetNonLocalNodeQ() ()
#1  0x000055dca18a68c4 in CsdNextMessage ()
#2  0x000055dca18a6db8 in CsdSchedulePoll ()
#3  0x000055dca18af635 in CmiCheckAffinity() ()
#4  0x000055dca188d358 in _initCharm(int, char**) ()
#5  0x000055dca103b772 in master_init (argc=9, argv=0x7ffc9ac1de08) at src/BackEnd.C:176
#6  0x000055dca0f35563 in main (argc=9, argv=0x7ffc9ac1de08) at src/mainfunc.C:49

This seems like the same issue as #2850, but it is not resolved with the development branch or using +no_isomalloc_sync. The issue is not reproduced with Charm++ 6.10.2. Please let me know if I can provide additional information. Thanks!

rbuch commented 2 years ago

@evan-charmworks Any idea what could be going on here?

@dmclark17 I don't think we've seen this on any other machines, I'm not sure if we have access to a DGX A100, but I can try to reproduce this on a DGX-2. In the mean time, if you have time, isolating a commit with git bisect may be helpful here. Also, do you know if this happens when not using a container?

Separately, in general we recommend using a multicore build when running on a single node, as I assume you are doing here, so it might be worth trying that if this is inhibiting production runs.

evan-charmworks commented 2 years ago

@dmclark17 How many logical nodes does your job invocation run on?

dmclark17 commented 2 years ago

I do not know if the issue is reproduced outside of a container, but I can experiment with that as well as git bisect. As another data point, the simplearrayhello test also seems to hang in the _initCharm function when running in a container. I am using a single node right now but plan to scale to multiple nodes.

I am using 8 logical nodes for these tests