NAMD hangs with UCX machine layer (in SMP mode) on Frontera

nitbhat commented 4 years ago

Steps to reproduce:

Download charm and build ucx-linux-x86_64-smp build with charm-6.10.0
Download NAMD and build Linux-x86_64-g++
Run NAMD on 16 nodes with the following commands

Run command: (from the script) ./Linux-x86_64-g++/charmrun +p 832 ./Linux-x86_64-g++/namd2 ++ppn 13 +pemap 4-55:2,5-55:2 +commap 0,2,1,3 ./runZIKV-50M-atoms.namd ++mpiexec ++remote-shell ibrun

Job submit command: sbatch --job-name=NAMD --nodes=16 --ntasks=64 --time=00:20:00 --partition=normal

This issue seems to be related to the other UCX issues on Frontera: https://github.com/UIUC-PPL/charm/issues/2635 https://github.com/UIUC-PPL/charm/issues/2636

PhilMiller commented 4 years ago

From discussion with @nitbhat

NAMD with UCX on Fronterra usually hangs, sometimes crashes, on runs with 4+ nodes

We had attributed the problem entirely to the UCX layer. However, recently, on some of my runs on Frontera I saw that the issue shows up with the MPI layer as well (not with Intel MPI, but with MPICH and OMPI). So, it seems like although with the UCX layer, it occurs more frequently, it also shows up with the MPI layer. However, it looks like since the pre-installed NAMD binaries were built using Intel MPI, they may not be facing the issue (that’s my guess). "

Since the bug shows up with different applications (NAMD, Enzo-P, Changa with a very large dataset), I’m guessing it is not an application bug. Since the bug shows up with different machine layers (UCX, MPI), I’m guessing it’s not a machine layer bug. That leaves two major candidates: Charm++ RTS (?) (Maybe there is a rare underlying Charm++ bug that is triggered by Frontera’s configuration) A common library specific to Frontera : For this I tried running the Frontera binaries on another machine (Thor - with similar hardware) and the bug didn’t show up there.

With that hypothesis, what does ldd binary_name print out on the different systems?

On Frontera, the UCX binary for Enzo-P prints this:

login2.frontera(1005)$ ldd enzo-p-nonsmp-ucx
linux-vdso.so.1 =>  (0x00007ffdb80f8000)
/opt/apps/xalt/xalt/lib64/libxalt_init.so (0x00002ae9cfd96000)
libucp.so.0 => /opt/apps/ucx/1.6.1/lib/libucp.so.0 (0x00002ae9cffc3000)
libuct.so.0 => /opt/apps/ucx/1.6.1/lib/libuct.so.0 (0x00002ae9d020d000)
libucs.so.0 => /opt/apps/ucx/1.6.1/lib/libucs.so.0 (0x00002ae9d042b000)
libucm.so.0 => /opt/apps/ucx/1.6.1/lib/libucm.so.0 (0x00002ae9d0786000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002ae9d099b000)
libhdf5.so.8 => /scratch1/03808/nbhat4/hdf5-1.8.13/build/lib/libhdf5.so.8 (0x00002ae9d0b9f000)
libz.so.1 => /lib64/libz.so.1 (0x00002ae9d0e81000)
libpng15.so.15 => /lib64/libpng15.so.15 (0x00002ae9d1097000)
libgfortran.so.5 => /opt/apps/gcc/9.1.0/lib64/libgfortran.so.5 (0x00002ae9d12c2000)
libboost_filesystem.so.1.69.0 => /opt/apps/gcc9_1/boost/1.69/lib/libboost_filesystem.so.1.69.0 (0x00002ae9cfbb8000)
libboost_system.so.1.69.0 => /opt/apps/gcc9_1/boost/1.69/lib/libboost_system.so.1.69.0 (0x00002ae9cfbd6000)
libm.so.6 => /lib64/libm.so.6 (0x00002ae9d1551000)
libstdc++.so.6 => /opt/apps/gcc/9.1.0/lib64/libstdc++.so.6 (0x00002ae9d1853000)
libgcc_s.so.1 => /opt/apps/gcc/9.1.0/lib64/libgcc_s.so.1 (0x00002ae9cfbdc000)
libc.so.6 => /lib64/libc.so.6 (0x00002ae9d1a33000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ae9d1e00000)
librt.so.1 => /lib64/librt.so.1 (0x00002ae9d201c000)
libnuma.so.1 => /lib64/libnuma.so.1 (0x00002ae9d2224000)
/lib64/ld-linux-x86-64.so.2 (0x00002ae9cfb72000)
libquadmath.so.0 => /opt/apps/gcc/9.1.0/lib/../lib64/libquadmath.so.0 (0x00002ae9cfbf8000)

My next question is how reproducible is the failure (hang or crash), and how quickly does it occur? Is it always during application startup, or does NAMD start making simulation progress before failure?

nitbhat commented 4 years ago

For the NAMD failure, specifically, here are the details:

• The failure is a hang that occurs with the runZIKV-50M-atoms.namd dataset. • With the UCX layer, it is consistently reproducible (happens for every run) on 16 nodes. • IIRC, I was not able to reproduce it for smaller nodes.. That's usually typical for the bugs we've seen on Frontera (the probability of a hang increases with the increase in number of nodes). • With the UCX layer, it seems to happen soon after initialization and during the simulation, specifically during minimization. (I'm attaching a sample namd output that hangs) namd-output-16nodes-ucx-680032.txt

• With the MPI layer (for all backends - IMPI built on ucx, MPICH built on UCX, OMPI built on ucx, MPICH not built on ucx, OMPI not build on ucx), I've not really seen the hang so far. It seems like the application is slower but moves along in the minimization process. However, having said that, I've not seen the application complete as well because it always has exceeded the maximum execution time. I could try reducing the dataset to confirm that MPI layer namd builds complete without any trouble.

PhilMiller commented 4 years ago

Does the failure occur with smaller NAMD data sets? https://twitter.com/RommieAmaro/status/1241880936641220609 says the sims they're running are 1.6M atoms, including fully explicit solvent

PhilMiller commented 4 years ago

@jcphill Can we get a copy of a data set that's involved in the current work? Need to see if we can get reliable runs on the actual target, or focus debugging efforts on the issues it encounters

nitbhat commented 4 years ago

I only ran with the smaller datasets a few times and didn't see the hang.

However, @jcphill just mentioned on the #namd channel (on PPL slack) that he did see occasional hangs for the small short jobs as well.

PhilMiller commented 4 years ago

Does NAMD have issues with non-SMP UCX?

trquinn commented 4 years ago

For ChaNGa, with the UCX layer on Frontera: Hangs appear to always occur at either "Domain decomposition": essentially an "all-to-all" between elements of a Chare array to get data into the right geometrical domain, or at Load Balancing, where nearly all elements of a Chare array are moved to a different PE.

With MPI, the main issue on Frontera is that it is very slow. Let me know if you want me to produce a "mininum data set" to produce the problem with ChaNGa.

PhilMiller commented 4 years ago

UCX-Verbs-Frontera Expected speedups if we can fix this

nitbhat commented 4 years ago

Does NAMD have issues with non-SMP UCX?

I haven't tested this so far. Let me do this right now.

However, Enzo-P which was seeing a similar hang (for 4 nodes and more consistently for 64 nodes) was using non-SMP UCX.

PhilMiller commented 4 years ago

If the same apparent hang arises with non-SMP, that's probably easier to isolate. OTOH, it may be easier to get stacks once it's hung with CharmDebug on an SMP build, since the comm thread will likely be answering CCD messages regardless of what the workers are doing.

Do we have a stack snapshot for a hung run? Collected with STAT, CharmDebug, manual gdb, whatever?

nitbhat commented 4 years ago

I'll post here when I determine the results from the non-SMP runs.

I had got stack traces for Enzo-P's hang on 4 nodes (224 processes), but not for NAMD. I can try getting them today.

nitbhat commented 4 years ago

All of Frontera's queues are down right now (it happens very often) and there seems to be an issue with their work filesystem as well.

I am going to rebuild all my binaries again and wait for the queues to come back up to test them.

PhilMiller commented 4 years ago

I can't really make sense of the stack trace files posted for Enzo-P in #2635. In the process 0 file, I see multiple threads interleaved, and several repetitions of each thread. Is that where you were 'polling' backtraces after the hang?

nitbhat commented 4 years ago

Yeah, if IIRC, I manually logged in to each of the nodes and iterated over all the running processes to attach gdb and dump the stack trace.

nitbhat commented 4 years ago

@PhilMiller I tried NAMD non-SMP with both UCX and MPI machine layers. Both crash soon after startup because of failing to allocate memory.


Info: Entering startup at 1.11557e+06 s, 16272.4 MB of memory in use
Info: Startup phase 0 took 0.042994 s, 16272.4 MB of memory in use
FATAL ERROR: Memory allocation failed on processor 404.
------------- Processor 404 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR:                                                                                                                                                                               Memory allocation failed on processor 404.

FATAL ERROR: Memory allocation failed on processor 406.
------------- Processor 406 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR:

nitbhat commented 4 years ago

@PhilMiller The queues on Frontera are back and seems like the work directory is active again. I am able to get the 16 node UCX-smp job to hang. Which would be a good method to get the stack traces? I think CharmDebug currently doesn't support non-netlrts based layers. So, we can either do GDB or STAT (looks like I'll have to install that from scratch, don't see it pre-installed).

PhilMiller commented 4 years ago

With gdb, you probably want something like

ssh node gdb -p '$(pgrep namd_binary_name)' -ex "thread apply all bt" -ex "quit" > output.log

You'll have to be a bit fancier with scripting that with multiple processes per node, making it loop over them or whatever.

The big question is whether there are any processes doing anything other than spinning in idle?

nitbhat commented 4 years ago

I spent some time trying to get the list of nodes for a running job. However, I have a script in place that takes a job id and dumps the stack trace from all the process across remote hosts. I was able to get a namd job to hang on 16 physical nodes (4 procs/host or physical node, ppn=13 (13 worker, 1 comm thread)) and then collected the stack traces. I've printed out info like Host:0, Process:0, ProcessID: 156951 before each gdb -p output.

stack_trace_hang_696750.txt

This is a fairly large file with stack traces from 64 processes. I'm yet to go over it.

nitbhat commented 4 years ago

From the output it looks like Thread 3 is always the comm thread and there is an extra thread Thread 16 which looks like a ucx reserved thread that keeps polling for messages (@brminich correct me if I'm wrong).

ST from the Thread 16 of one of the processes.

Thread 16 (Thread 0x2b0594406700 (LWP 166871)):
#0  0x00002b0592074603 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b059114be2c in ucs_async_thread_func (arg=0x1bf4c90) at async/thread.c:93
#2  0x00002b0590a8ddd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b059207402d in clone () from /lib64/libc.so.6

I'm also parallelly trying to see if I can get a smaller job to hang. (in order to make it easier to debug).

nitbhat commented 4 years ago

Sorry, I accidentally closed this.

jcphill commented 4 years ago

Any progress or updates on this?

nitbhat commented 4 years ago

No real progress, we're yet to determine the cause for the bug (i.e the hang).

I tried running 200 iterations of smaller jobs (stmv and apoa1) on 2-4 nodes and couldn't get any of them to hang on Frontera (tried with ldBalancer none & commOnly on as well).

The Enzo-P hang seems to be more consistently reproducible on a smaller set of nodes (4 nodes). For that reason, I got the stack trace from Enzo-P hang here (https://github.com/UIUC-PPL/charm/issues/2635) and going through a 224 process run stack trace, it seems like all PEs are busy in the converse scheduler loop. From that, it looks like there might be a message that's not making it to the right message handler or entry method (and that's causing the hang).

Currently, I'm writing some small tests (bombard test, kNeighbors and randomTTL) in order to try and get them to hang on Frontera, so that it becomes easier to debug them.

Additionally, @lvkale, @ericjbohm and I also discussed that it might be worthwhile to have some infrastructure in Charm++ that can confirm message delivery. (an ack for every sent message to probably catch the messages that don't get received).

trquinn commented 4 years ago

A data point from ChaNGa: I can get a hang on 2 nodes (SMP) with the dwf1.6144 dataset in the initial domain decomposition. The hang goes away if I reduce the size of the chare arrays. But I also notice that for smaller chare arrays the domain decomposition time gets large in proportion to the array size: array size | DD time 1024 108s 2048 288s 4096 hangs (10 minute timeout)

I tried a gprof (not sure what this reports in SMP mode) to attempt to track down where the time is going, and I got (for the 2048 case): Flat profile:

Each sample counts as 0.01 seconds. % cumulative self self total
time seconds seconds calls Ts/call Ts/call name
30.33 539.18 539.18 CcdCallBacks 27.07 1020.49 481.31 gethrctime 7.82 1159.52 139.03 CcdRaiseCondition 7.14 1286.45 126.93 CsdNextMessage 4.40 1364.65 78.20 CsdScheduleForever 2.19 1403.67 39.02 CmiNotifyStillIdle(CmiIdleState*) 1.91 1437.63 33.96 CmiGetNonLocal 1.62 1466.42 28.79 CmiCpuTimer 1.62 1495.21 28.79 inithrc 1.60 1523.64 28.43 CsdSchedulerState_new 1.52 1550.64 27.00 CmiGetNonLocalNodeQ() 1.49 1577.16 26.52 CqsDequeue 1.13 1597.30 20.14 CmiWallTimer 1.09 1616.73 19.43 CqsLength 1.06 1635.51 18.78 CqsEnqueueFifo

nitbhat commented 4 years ago

Thanks @trquinn. That's great. I've not got anything to hang on less than 3 nodes. I can try reproducing this and getting stack traces for that run and see if anything suspicious shows up.

From the gprof output, it looks like most of the time is spent in the idle scheduler loop. (A lot of those Ccd, Csd and timer calls are called when the scheduler is just checking different queues to see if there is an incoming message).

nitbhat commented 4 years ago

How would you characterize the messaging pattern during the domain decomposition? Is it small and bursty?

trquinn commented 4 years ago

This is a lot of very small messages all at once; essentially an "all-to-all" among the chares. To reproduce: export UCX_ZCOPYTHRESH=-1 ppn=55 ibrun ./ChaNGa.smp +useckloop +bench -wall 10 -v 1 -p 4096 -maxbal 10.0 ++ppn $pp n +setcpuaffinity +commap 0 +pemap 2-54:2,1-55:2 dwf1.6144.param >& DIAG.$SLURM JOB_NUM_NODES.$qcount

Lowering the "-p 4096" will get it to run (but on other machines, I run this benchmark with -p 32768").

trquinn commented 4 years ago

I did some projections. TimelineScreenshot.pdf Here is a timeline screen shot. Message sizes run from 880 bytes to a few 10s of kbytes. Note there are latencies of order 300 seconds! Even when PEs are on the same node!

trquinn commented 4 years ago

Updated: TimelineScreenshot.pdf

Here is a screen shot that includes comm tracing. Note that the comm threads are swamped. Something else that is peculiar: the pink are updateLocation messages. No load balancing has been done, so why is charm unaware of the array element placement?

epmikida commented 4 years ago

That second PDF is just showing up blank for me, but I'll look into the location messages. Does ChaNGa do anything with demand creation, or dynamic insertion? Those can also potentially trigger updateLocation messages.

trquinn commented 4 years ago

Projections "Save all timelines to image" seems not to work. I've updated the PDF to be just visible PEs. (Maybe getting off topic here): The code for the chare array creation is: CkArrayOptions opts(numTreePieces); CProxy_BlockMap myMap = CProxy_BlockMap::ckNew(); opts.setMap(myMap); CProxy_TreePiece pieces = CProxy_TreePiece::ckNew(opts); Updated: I forgot a key line: the declaration of "opts" which I just added. 'numTreePieces' is an integer read from the command line.

epmikida commented 4 years ago

So based on this, your array is using dynamic insertion, since the array is initially created with 0 elements. The elements are created at some later point either via demand creation, or via insert calls. This can result in updateLocation calls when a PE first tries to message an element it doesn't know about. It needs to know the elements ID to pack into the message, but all it has is the index. So it sends a location request to the homePe, and the homePe replies with an updateLocation message containing the ID of the element.

For most applications, the fix for this is to set bounds on the arrays index space. If bounds are set, then Charm++ can automatically get an ID from an index with a simple bit-packing scheme. However, if I recall correctly, ChaNGa uses a user defined array index? We don't currently have a solution for that other than the default ID creation scheme, which requires these location request/reply messages.

trquinn commented 4 years ago

TimelineScreenshot-MPI.pdf

As a control, I ran the same simulation with the MPI SMP charm build (Intel MPI). The resulting timelines are attached. The comm thread is still being hammered; the difference is that the total time is factor of 10 smaller than UCX! One hypothesis, therefore, is that message processing by the UCX comm thread is very computationally expensive, possibly exacerbated when there are a lot of messages to process.

epmikida commented 4 years ago

Projections "Save all timelines to image" seems not to work. I've updated the PDF to be just visible PEs. (Maybe getting off topic here): The code for the chare array creation is: CkArrayOptions opts(numTreePieces); CProxy_BlockMap myMap = CProxy_BlockMap::ckNew(); opts.setMap(myMap); CProxy_TreePiece pieces = CProxy_TreePiece::ckNew(opts); Updated: I forgot a key line: the declaration of "opts" which I just added. 'numTreePieces' is an integer read from the command line.

Hmm, that's interesting. Without dynamic creation or migration, the only other thing I can think of that would generate updateLocation messages is array messages that take multiple bounces to get to their location, which also shouldn't be happening here.

There's also a parameter messageBufferingThreshold which could trigger large messages to trigger updateLocation messages, but AFAIK that parameter is off by default, unless UCX handles it differently?

stwhite91 commented 4 years ago

UCX doesn't know anything about the "_messageBufferingThreshold" inside CkLocMgr.

If you can insert a breakpoint on CkLocMgr::multiHop() defined in charm/src/ck-core/cklocation.C that would tell us if there's something weird going on there

epmikida commented 4 years ago

Yeah I just talked to @nitbhat about this, he's going to try the example Tom sent above to see if he can reproduce.

@trquinn how many array elements (numTreePieces) are created in the example above?

nitbhat commented 4 years ago

@trquinn: I followed the following steps:

Building charm module load impi/19.0.5 && which mpicc && ./build ChaNGa ucx-linux-x86_64 --enable-error-checking --suffix=changa-prod --with-production --basedir=/work/03808/nbhat4/frontera/ucx/build_1.6.1 -j16

Building ChaNGa

export CHARM_DIR=/work/03808/nbhat4/frontera/charm/ucx-linux-x86_64-smp-changa-prod
./configure
mkae -j56

Job script


#!/bin/bash
#SBATCH -J changa_2nodes
#SBATCH -p normal
#SBATCH -t 00:30:00
#SBATCH -N 2
#SBATCH -n 2
#SBATCH --ntasks-per-node=1
#SBATCH -A ASC20007
#SBATCH -o ./results/changa-output-2nodes-ucx-prod-dwf-%A.out

cd /scratch1/03808/nbhat4/changa/

module load impi which mpirun

export UCX_ZCOPY_THRESH=-1 ppn=54

echo "2 nodes, 2 ranks, 55 ppn, ucx" ibrun ./ChaNGa.prod.smp +useckloop +bench -wall 10 -v 1 -p 4096 -maxbal 10.0 ++ppn $ppn +setcpuaffinity +commap 0 +pemap 2-54:2,1-55:2 dwf1.6144.param echo "2 nodes, 2 ranks, 55 ppn, ucx"


With this, I saw that in 2 mins or so, the output reaches

Input file, Time:0.239443 Redshift:0.341011 Expansion factor:0.745706 Simulation to Time:0.239543 Redshift:0.340548 Expansion factor:0.745964 WARNING: Could not open redshift input file: dwf1.6144.bench.red Initial Domain decomposition ... Sorter: Histograms balanced after 19 iterations.


And then, after about 15 mins, I see a crash because of out of memory.

32 Loading particles ... trying Tipsy ... took 1.572218 seconds.$ 33 N: 51794908$ 34 Input file, Time:0.239443 Redshift:0.341011 Expansion factor:0.745706$ 35 Simulation to Time:0.239543 Redshift:0.340548 Expansion factor:0.745964$ 36 WARNING: Could not open redshift input file: dwf1.6144.bench.red$ 37 Initial Domain decomposition ... Sorter: Histograms balanced after 19 iterations.$ 38 total 1126.44 seconds.$ 39 Initial load balancing ... Orb3dLB_notopo: Step 0$ 40 numActiveObjects: 4096, numInactiveObjects: 0$ 41 active PROC range: 0 to 107$ 42 Migrating all: numActiveObjects: 4096, numInactiveObjects: 0$ 43 [Orb3dLB_notopo] sorting$ 44 ***$ 45 Orb3dLB_notopo stats: maxObjLoad 0.000795$ 46 Orb3dLB_notopo stats: minWall 0.521159 maxWall 0.540816 avgWall 0.522287 maxWall/avgWall 1.035475$ 47 Orb3dLB_notopo stats: minIdle 0.000000 maxIdle 0.506066 avgIdle 0.495555 minIdle/avgIdle 0.000000$ 48 Orb3dLB_notopo stats: minPred 0.007944 maxPred 0.011722 avgPred 0.010661 maxPred/avgPred 1.099506$ 49 Orb3dLB_notopo stats: minPiece 37.000000 maxPiece 39.000000 avgPiece 37.925926 maxPiece/avgPiece 1.028320$ 50 Orb3dLB_notopo stats: minBg 0.010114 maxBg 0.540137 avgBg 0.016071 maxBg/avgBg 33.608630$ 51 Orb3dLB_notopo stats: orb migrated 4080 refine migrated 0 objects$ 52 Building trees ... took 0.498366 seconds.$ 53 Maximum memory: -11 MB$ 54 Calculating gravity (tree bucket, theta = 0.700000) ... [1586382996.141156] [c191-011:425253:0] rcache.c:123 UCX ERROR mmmap(size=36864) failed: Cannot allocate memory$ 55 [1586382996.141649] [c191-011:425253:0] mpool.c:187 UCX ERROR Failed to allocate memory pool (name=rcache_inv_mp) chunk: Out of memory$ 56 [1586382996.141656] [c191-011:425253:0] rcache.c:336 UCX ERROR Failed to allocate invalidation entry for 0x2bcf40000000..0x2bcf44000000, data corruption may occur$ 57 [1586382996.141706] [c191-011:425253:0] rcache.c:123 UCX ERROR mmmap(size=36864) failed: Cannot allocate memory$



So, I'm not able to see the hang (but a crash which looks like hang because it doesn't print out anything for 15 mins or so). Let me know if I'm doing something wrong. 

I'll try determining the cause of the `updateLocation` messages.

nitbhat commented 4 years ago

Also, I used ChaNGa's master branch. Let me know if you were using a specific branch.

nitbhat commented 4 years ago

Yeah I just talked to @nitbhat about this, he's going to try the example Tom sent above to see if he can reproduce.

@trquinn how many array elements (numTreePieces) are created in the example above?

It looks like 4096 tree pieces are created.

trquinn commented 4 years ago

Note thiat @nitbhat took over 1000 seconds to do domain decomposition with 4096 pieces. My projections where done with 2048 pieces, and took only (?) 300 seconds. The MPI version took 30 seconds. Note that I put a CkExit() in after domain decomposition so I could focus on the DD problem. I have a hunch that the later memory problem is also due to the message bottleneck we are seeing with DD. If you also want to stop after DD, make this change:

diff --git a/ParallelGravity.cpp b/ParallelGravity.cpp
index 4d13d2a..3d045e0 100644
--- a/ParallelGravity.cpp
+++ b/ParallelGravity.cpp
@@ -1662,6 +1662,7 @@ void Main::domainDecomp(int iPhase)
     timings[iPhase].tDD += tDD;
     CkPrintf("total %g seconds.\n", tDD);

+    CkExit();
     if(verbosity && !bDoDD)
         CkPrintf("Skipped DD\n");
 }

nitbhat commented 4 years ago

In my charm, I added the following prints


login4.frontera(1024)$ git diff
diff --git a/src/ck-core/cklocation.C b/src/ck-core/cklocation.C
index 1e38fbd45..20f92ff68 100644
--- a/src/ck-core/cklocation.C
+++ b/src/ck-core/cklocation.C
@@ -2403,6 +2403,7 @@ void CkLocMgr::requestLocation(CmiUInt8 id, const int peToTell,

 void CkLocMgr::updateLocation(const CkArrayIndex &idx, CmiUInt8 id, int nowOnPe) {
   DEBN(("%d updateLocation for %s on %d\n", CkMyPe(), idx2str(idx), nowOnPe));
+  CmiPrintf("[%d][%d][%d] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for %s on %d\n", CkMyPe(), CkMyNode(), CkMyRank(), idx2str(idx), nowOnPe);
   inform(idx, id, nowOnPe);
   deliverAnyBufferedMsgs(id, bufferedRemoteMsgs);
 }
@@ -2769,6 +2770,7 @@ void CkLocMgr::multiHop(CkArrayMessage *msg)
 {
        CK_MAGICNUMBER_CHECK
        int srcPe=msg->array_getSrcPe();
+  CmiPrintf("[%d][%d][%d] ############################### multiHop and srcPe is %d\n", CkMyPe(), CkMyNode(), CkMyRank(), srcPe);
        if (srcPe==CkMyPe())
           DEB((AA "Odd routing: local element %u is %d hops away!\n" AB, msg->array_element_id(),msg->array_hops()));
        else
@@ -3028,6 +3030,7 @@ void CkLocMgr::metaLBCallLB(CkLocRec *rec) {
 */
 void CkLocMgr::immigrate(CkArrayElementMigrateMessage *msg)
 {
+  CmiPrintf("[%d][%d][%d] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ immigrate\n", CkMyPe(), CkMyNode(), CkMyRank());
        const CkArrayIndex &idx=msg->idx;

        PUP::fromMem p(msg->packData);

In my output, I initially see a lot of updateLocation prints (have not added all updateLocation prints)

  66 Charm++> Running on 2 hosts (2 sockets x 28 cores x 1 PUs = 56-way SMP)$
  67 Charm++> cpu topology info is gathered in 0.018 seconds.$
  68 WARNING: +useckloop is a command line argument beginning with a '+' but was not parsed by the RTS.$
  69 WARNING: +bench is a command line argument beginning with a '+' but was not parsed by the RTS.$
  70 If any of the above arguments were intended for the RTS you may need to recompile Charm++ with different options.$
  71 [0] MultistepLB_notopo created$
  72 WARNING: bStandard parameter ignored; Output is always standard.$
  73 Using CkLoop 1$
  74 ChaNGa version 3.4, commit v3.4-57-g3f4947c$
  75 Running on 108 processors/ 2 nodes with 4096 TreePieces$
  76 yieldPeriod set to 5$
  77 Prefetching...ON$
  78 Number of chunks for remote tree walk set to 1$
  79 Chunk Randomization...ON$
  80 cache 1$
  81 cacheLineDepth 4$
  82 Verbosity level 1$
  83 Domain decomposition...SFC Peano-Hilbert$
  84 CkLoopLib is used in SMP with simple dynamic scheduling (converse-level notification)$
  85 Created 4096 pieces of tree$
  86 [0][0][0] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for  on 17$
  87 [6][0][6] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for 6 on 0$
  88 [0][0][0] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for 19 on 51$
  89 [17][0][17] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for 17 on 0$

But I don't see multiHop messages until

4142 [21][0][21] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for 3909 on 102$
4143 [21][0][21] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for 2505 on 65$
4144 [21][0][21] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for 3153 on 82$
4145  took 1.108604 seconds.$
4146 N: 51794908$
4147 Input file, Time:0.239443 Redshift:0.341011 Expansion factor:0.745706$
4148 Simulation to Time:0.239543 Redshift:0.340548 Expansion factor:0.745964$
4149 WARNING: Could not open redshift input file: dwf1.6144.bench.red$
4150 Initial Domain decomposition ... Sorter: Histograms balanced after 19 iterations.$
4151 [82][1][28] ############################### multiHop and srcPe is 88$
4152 [82][1][28] ############################### multiHop and srcPe is 86$
4153 [82][1][28] ############################### multiHop and srcPe is 67$

So, it looks like multiHop is not involved in the code path involving updateLocation.

brminich commented 4 years ago

I suspect one flow in UCX ML, let me prepare a patch to try

nitbhat commented 4 years ago

On @jcphill's suggestion, I made the following changes to NAMD to trigger a QD before the minimization phase (where the hang is seen).

diff --git a/src/Controller.C b/src/Controller.C
index f5d8d172..bac79b4e 100644
--- a/src/Controller.C
+++ b/src/Controller.C
@@ -284,6 +284,7 @@ void Controller::algorithm(void)
   int scriptSeq = 0;
   BackEnd::awaken();
   while ( (scriptTask = broadcast->scriptBarrier.get(scriptSeq++)) != SCRIPT_END) {
+    BackEnd::awaken();
     switch ( scriptTask ) {
       case SCRIPT_OUTPUT:
         enqueueCollections(FILE_OUTPUT);
@@ -403,7 +404,6 @@ void Controller::algorithm(void)
       default:
         NAMD_bug("Unknown task in Controller::algorithm");
     }
-    BackEnd::awaken();
   }
   enqueueCollections(END_OF_RUN);
   outputExtendedSystem(END_OF_RUN);
diff --git a/src/Node.C b/src/Node.C
index d4f70ac8..42e762ed 100644
--- a/src/Node.C
+++ b/src/Node.C
@@ -1335,11 +1335,13 @@ void Node::enableExitScheduler(void) {
   if ( CkMyPe() ) {
     sendEnableExitScheduler();
   } else {
+    CmiPrintf("[%d][%d][%d] CkStartQd\n", CmiMyPe(), CmiMyNode(), CmiMyRank());
     CkStartQD(CkIndex_Node::exitScheduler(), &thishandle);
   }
 }

 void Node::exitScheduler(void) {
+  CmiPrintf("[%d][%d][%d] Node::exitScheduler\n", CmiMyPe(), CmiMyNode(), CmiMyRank());
   //CmiPrintf("exitScheduler %d\n",CkMyPe());
   CsdExitScheduler();
 }

On running this without load balancing, I see that the hang is reached, but quiescence is not. NAMD Output: namd-output-16nodes-smp-ucx-prod-jim4-723992.out.txt -> Look at CkStartQd printed before minimization.

This strengthens the hypothesis of there being missing/undelivered message(s).

nitbhat commented 4 years ago

@epmikida I saw that CkLocMgr::informHome calls updateLocation in the beginning (before Initial Domain decomposition is printed) and CkLocMgr::multiHop calls updateLocation after Initial Domain decomposition is printed. Does that indicate anything weird going on?

epmikida commented 4 years ago

Hmm...that's weird. informHome is called during migration, and creation of elements to let the homePe know where the element is. So in this case, it's called via element creation.

But, the only time it sends updateLocation messages is when the element is not created on it's homePe. I don't know why that would be the case here, since it is just doing static creation? Unless there's something weird in the map?

brminich commented 4 years ago

@trquinn, @nitbhat, Can you please try running with #2799 patch? I'd expect it to fix UCX ML slowness in certain cases (when lots of small messages are arriving)

ericjbohm commented 4 years ago

Has anyone checked if there is a bug in PE mapping on frontera? If all processes do not agree on where the some PE is, you might get this sort of behavior.

On Fri, Apr 10, 2020 at 5:09 AM Mikhail Brinskiy notifications@github.com wrote:

@trquinn https://github.com/trquinn, @nitbhat https://github.com/nitbhat, Can you please try running with #2799 https://github.com/UIUC-PPL/charm/pull/2799 patch? I'd expect it to fix UCX ML slowness in certain cases (when lots of small messages are arriving)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UIUC-PPL/charm/issues/2716#issuecomment-611970038, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3HFH2CQ2AVRXFAQSGCGZLRL3V5RANCNFSM4KYAHJWA .

nitbhat commented 4 years ago

@epmikida: I added the following prints Inside createLocal: CmiPrintf(“[%d][%d][%d] ***************** CkLocMgr::createLocal, calling informHome id:%s and now on pe:%d\n”, CkMyPe(), CkMyNode(), CkMyRank(), idx2str(idx), CkMyPe());

Inside updateLocation CmiPrintf(“[%d][%d][%d] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation 1 for id:%s on %d\n”, CkMyPe(), CkMyNode(), CkMyRank(), idx2str(idx), nowOnPe);

Inside informHome: CmiPrintf(“[%d][%d][%d] ***************** CkLocMgr::informHome, calling updateLocation 1 to move id:%s to %d (nowOnPe:%d)\n”, CkMyPe(), CkMyNode(), CkMyRank(), idx2str(idx), home, nowOnPe);

And with that (and some other non-relevant prints), I see the following output: changa-output-2nodes-ucx-debug-dwf-725634.out.txt

It's the createLocal calls that invoke informHome and finally updateLocation. From the output I see that 4059 of the 4096 chares are calling updateLocation from informHome (so 4059 chares meet the condition home!=CkMyPe() && home!=nowOnPe in informHome).

epmikida commented 4 years ago

Yeah this is really weird. Based on the output, 4059 chares think they are being created on a PE other than their home, so they are sending an updateLocation to their home so that it knows where the element currently is.

Possibilities for why this might happen:

As @ericjbohm said, some bug in PE mapping on Frontera.
A bug in the populateInitial function of BlockMap
A bug in the homePe function of BlockMap

After some manual inspection of the output, the homePe seems correct for each element. ie element 1 thinks its homePe is 1, 2 thinks its homePe is 2, etc. So the homePe calculation of the BlockMap is probably not the culprit.

One other oddity in the output though is that there were duplicates of some elements (ie element 4049 appears to be created on two different PEs. There are also some elements that are created that shouldn't be. For example, the output has this line: [38][0][38] ***************** CkLocMgr::informHome, calling updateLocation 1 to move id:9199 to 45 (nowOnPe:38)

These could possibly just be mangled output? But still weird.

nitbhat commented 4 years ago

I've attached the output from ChaNGa when running it on the MPI layer instead of UCX. I still see 4059 chares that are calling updateLocation.

changa-output-2nodes-mpi-debug-dwf-728723.out.txt

yeah, I'm not sure how that id:9199 showed up. (hoping that it was just mangled output, but I can watch out for weird ids).

epmikida commented 4 years ago

I've attached the output from ChaNGa when running it on the MPI layer instead of UCX. I still see 4059 chares that are calling updateLocation.

changa-output-2nodes-mpi-debug-dwf-728723.out.txt

yeah, I'm not sure how that id:9199 showed up. (hoping that it was just mangled output, but I can watch out for weird ids).

That one also has a number of weird ID's at the end, and some duplicated chares. Another odd thing is that the duplication doesn't show up in the createLocal call, just the informHome call, and the two calls don't seem to agree on the homePe of the object.

charmplusplus / charm

NAMD hangs with UCX machine layer (in SMP mode) on Frontera #2716