Closed nitbhat closed 4 years ago
From discussion with @nitbhat
We had attributed the problem entirely to the UCX layer. However, recently, on some of my runs on Frontera I saw that the issue shows up with the MPI layer as well (not with Intel MPI, but with MPICH and OMPI). So, it seems like although with the UCX layer, it occurs more frequently, it also shows up with the MPI layer. However, it looks like since the pre-installed NAMD binaries were built using Intel MPI, they may not be facing the issue (that’s my guess). "
Since the bug shows up with different applications (NAMD, Enzo-P, Changa with a very large dataset), I’m guessing it is not an application bug. Since the bug shows up with different machine layers (UCX, MPI), I’m guessing it’s not a machine layer bug. That leaves two major candidates: Charm++ RTS (?) (Maybe there is a rare underlying Charm++ bug that is triggered by Frontera’s configuration) A common library specific to Frontera : For this I tried running the Frontera binaries on another machine (Thor - with similar hardware) and the bug didn’t show up there.
With that hypothesis, what does ldd binary_name print out on the different systems?
On Frontera, the UCX binary for Enzo-P prints this:
login2.frontera(1005)$ ldd enzo-p-nonsmp-ucx linux-vdso.so.1 => (0x00007ffdb80f8000) /opt/apps/xalt/xalt/lib64/libxalt_init.so (0x00002ae9cfd96000) libucp.so.0 => /opt/apps/ucx/1.6.1/lib/libucp.so.0 (0x00002ae9cffc3000) libuct.so.0 => /opt/apps/ucx/1.6.1/lib/libuct.so.0 (0x00002ae9d020d000) libucs.so.0 => /opt/apps/ucx/1.6.1/lib/libucs.so.0 (0x00002ae9d042b000) libucm.so.0 => /opt/apps/ucx/1.6.1/lib/libucm.so.0 (0x00002ae9d0786000) libdl.so.2 => /lib64/libdl.so.2 (0x00002ae9d099b000) libhdf5.so.8 => /scratch1/03808/nbhat4/hdf5-1.8.13/build/lib/libhdf5.so.8 (0x00002ae9d0b9f000) libz.so.1 => /lib64/libz.so.1 (0x00002ae9d0e81000) libpng15.so.15 => /lib64/libpng15.so.15 (0x00002ae9d1097000) libgfortran.so.5 => /opt/apps/gcc/9.1.0/lib64/libgfortran.so.5 (0x00002ae9d12c2000) libboost_filesystem.so.1.69.0 => /opt/apps/gcc9_1/boost/1.69/lib/libboost_filesystem.so.1.69.0 (0x00002ae9cfbb8000) libboost_system.so.1.69.0 => /opt/apps/gcc9_1/boost/1.69/lib/libboost_system.so.1.69.0 (0x00002ae9cfbd6000) libm.so.6 => /lib64/libm.so.6 (0x00002ae9d1551000) libstdc++.so.6 => /opt/apps/gcc/9.1.0/lib64/libstdc++.so.6 (0x00002ae9d1853000) libgcc_s.so.1 => /opt/apps/gcc/9.1.0/lib64/libgcc_s.so.1 (0x00002ae9cfbdc000) libc.so.6 => /lib64/libc.so.6 (0x00002ae9d1a33000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ae9d1e00000) librt.so.1 => /lib64/librt.so.1 (0x00002ae9d201c000) libnuma.so.1 => /lib64/libnuma.so.1 (0x00002ae9d2224000) /lib64/ld-linux-x86-64.so.2 (0x00002ae9cfb72000) libquadmath.so.0 => /opt/apps/gcc/9.1.0/lib/../lib64/libquadmath.so.0 (0x00002ae9cfbf8000)
My next question is how reproducible is the failure (hang or crash), and how quickly does it occur? Is it always during application startup, or does NAMD start making simulation progress before failure?
For the NAMD failure, specifically, here are the details:
• The failure is a hang that occurs with the runZIKV-50M-atoms.namd
dataset.
• With the UCX layer, it is consistently reproducible (happens for every run) on 16 nodes.
• IIRC, I was not able to reproduce it for smaller nodes.. That's usually typical for the bugs we've seen on Frontera (the probability of a hang increases with the increase in number of nodes).
• With the UCX layer, it seems to happen soon after initialization and during the simulation, specifically during minimization. (I'm attaching a sample namd output that hangs)
namd-output-16nodes-ucx-680032.txt
• With the MPI layer (for all backends - IMPI built on ucx, MPICH built on UCX, OMPI built on ucx, MPICH not built on ucx, OMPI not build on ucx), I've not really seen the hang so far. It seems like the application is slower but moves along in the minimization process. However, having said that, I've not seen the application complete as well because it always has exceeded the maximum execution time. I could try reducing the dataset to confirm that MPI layer namd builds complete without any trouble.
Does the failure occur with smaller NAMD data sets? https://twitter.com/RommieAmaro/status/1241880936641220609 says the sims they're running are 1.6M atoms, including fully explicit solvent
@jcphill Can we get a copy of a data set that's involved in the current work? Need to see if we can get reliable runs on the actual target, or focus debugging efforts on the issues it encounters
I only ran with the smaller datasets a few times and didn't see the hang.
However, @jcphill just mentioned on the #namd channel (on PPL slack) that he did see occasional hangs for the small short jobs as well.
Does NAMD have issues with non-SMP UCX?
For ChaNGa, with the UCX layer on Frontera: Hangs appear to always occur at either "Domain decomposition": essentially an "all-to-all" between elements of a Chare array to get data into the right geometrical domain, or at Load Balancing, where nearly all elements of a Chare array are moved to a different PE.
With MPI, the main issue on Frontera is that it is very slow. Let me know if you want me to produce a "mininum data set" to produce the problem with ChaNGa.
Expected speedups if we can fix this
Does NAMD have issues with non-SMP UCX?
I haven't tested this so far. Let me do this right now.
However, Enzo-P which was seeing a similar hang (for 4 nodes and more consistently for 64 nodes) was using non-SMP UCX.
If the same apparent hang arises with non-SMP, that's probably easier to isolate. OTOH, it may be easier to get stacks once it's hung with CharmDebug on an SMP build, since the comm thread will likely be answering CCD messages regardless of what the workers are doing.
Do we have a stack snapshot for a hung run? Collected with STAT, CharmDebug, manual gdb, whatever?
I'll post here when I determine the results from the non-SMP runs.
I had got stack traces for Enzo-P's hang on 4 nodes (224 processes), but not for NAMD. I can try getting them today.
All of Frontera's queues are down right now (it happens very often) and there seems to be an issue with their work
filesystem as well.
I am going to rebuild all my binaries again and wait for the queues to come back up to test them.
I can't really make sense of the stack trace files posted for Enzo-P in #2635. In the process 0 file, I see multiple threads interleaved, and several repetitions of each thread. Is that where you were 'polling' backtraces after the hang?
Yeah, if IIRC, I manually logged in to each of the nodes and iterated over all the running processes to attach gdb and dump the stack trace.
@PhilMiller I tried NAMD non-SMP with both UCX and MPI machine layers. Both crash soon after startup because of failing to allocate memory.
Info: Entering startup at 1.11557e+06 s, 16272.4 MB of memory in use
Info: Startup phase 0 took 0.042994 s, 16272.4 MB of memory in use
FATAL ERROR: Memory allocation failed on processor 404.
------------- Processor 404 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: Memory allocation failed on processor 404.
FATAL ERROR: Memory allocation failed on processor 406.
------------- Processor 406 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR:
@PhilMiller The queues on Frontera are back and seems like the work directory is active again. I am able to get the 16 node UCX-smp job to hang. Which would be a good method to get the stack traces? I think CharmDebug currently doesn't support non-netlrts based layers. So, we can either do GDB or STAT (looks like I'll have to install that from scratch, don't see it pre-installed).
With gdb, you probably want something like
ssh node gdb -p '$(pgrep namd_binary_name)' -ex "thread apply all bt" -ex "quit" > output.log
You'll have to be a bit fancier with scripting that with multiple processes per node, making it loop over them or whatever.
The big question is whether there are any processes doing anything other than spinning in idle?
I spent some time trying to get the list of nodes for a running job. However, I have a script in place that takes a job id and dumps the stack trace from all the process across remote hosts. I was able to get a namd job to hang on 16 physical nodes (4 procs/host or physical node, ppn=13 (13 worker, 1 comm thread)) and then collected the stack traces. I've printed out info like Host:0, Process:0, ProcessID: 156951
before each gdb -p
output.
This is a fairly large file with stack traces from 64 processes. I'm yet to go over it.
From the output it looks like Thread 3
is always the comm thread and there is an extra thread Thread 16
which looks like a ucx reserved thread that keeps polling for messages (@brminich correct me if I'm wrong).
ST from the Thread 16 of one of the processes.
Thread 16 (Thread 0x2b0594406700 (LWP 166871)):
#0 0x00002b0592074603 in epoll_wait () from /lib64/libc.so.6
#1 0x00002b059114be2c in ucs_async_thread_func (arg=0x1bf4c90) at async/thread.c:93
#2 0x00002b0590a8ddd5 in start_thread () from /lib64/libpthread.so.0
#3 0x00002b059207402d in clone () from /lib64/libc.so.6
I'm also parallelly trying to see if I can get a smaller job to hang. (in order to make it easier to debug).
Sorry, I accidentally closed this.
Any progress or updates on this?
No real progress, we're yet to determine the cause for the bug (i.e the hang).
I tried running 200 iterations of smaller jobs (stmv and apoa1) on 2-4 nodes and couldn't get any of them to hang on Frontera (tried with ldBalancer none
& commOnly on
as well).
The Enzo-P hang seems to be more consistently reproducible on a smaller set of nodes (4 nodes). For that reason, I got the stack trace from Enzo-P hang here (https://github.com/UIUC-PPL/charm/issues/2635) and going through a 224 process run stack trace, it seems like all PEs are busy in the converse scheduler loop. From that, it looks like there might be a message that's not making it to the right message handler or entry method (and that's causing the hang).
Currently, I'm writing some small tests (bombard test, kNeighbors and randomTTL) in order to try and get them to hang on Frontera, so that it becomes easier to debug them.
Additionally, @lvkale, @ericjbohm and I also discussed that it might be worthwhile to have some infrastructure in Charm++ that can confirm message delivery. (an ack for every sent message to probably catch the messages that don't get received).
A data point from ChaNGa: I can get a hang on 2 nodes (SMP) with the dwf1.6144 dataset in the initial domain decomposition. The hang goes away if I reduce the size of the chare arrays. But I also notice that for smaller chare arrays the domain decomposition time gets large in proportion to the array size: array size | DD time 1024 108s 2048 288s 4096 hangs (10 minute timeout)
I tried a gprof (not sure what this reports in SMP mode) to attempt to track down where the time is going, and I got (for the 2048 case): Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
30.33 539.18 539.18 CcdCallBacks
27.07 1020.49 481.31 gethrctime
7.82 1159.52 139.03 CcdRaiseCondition
7.14 1286.45 126.93 CsdNextMessage
4.40 1364.65 78.20 CsdScheduleForever
2.19 1403.67 39.02 CmiNotifyStillIdle(CmiIdleState*)
1.91 1437.63 33.96 CmiGetNonLocal
1.62 1466.42 28.79 CmiCpuTimer
1.62 1495.21 28.79 inithrc
1.60 1523.64 28.43 CsdSchedulerState_new
1.52 1550.64 27.00 CmiGetNonLocalNodeQ()
1.49 1577.16 26.52 CqsDequeue
1.13 1597.30 20.14 CmiWallTimer
1.09 1616.73 19.43 CqsLength
1.06 1635.51 18.78 CqsEnqueueFifo
Thanks @trquinn. That's great. I've not got anything to hang on less than 3 nodes. I can try reproducing this and getting stack traces for that run and see if anything suspicious shows up.
From the gprof output, it looks like most of the time is spent in the idle scheduler loop. (A lot of those Ccd, Csd and timer calls are called when the scheduler is just checking different queues to see if there is an incoming message).
How would you characterize the messaging pattern during the domain decomposition? Is it small and bursty?
This is a lot of very small messages all at once; essentially an "all-to-all" among the chares. To reproduce: export UCX_ZCOPYTHRESH=-1 ppn=55 ibrun ./ChaNGa.smp +useckloop +bench -wall 10 -v 1 -p 4096 -maxbal 10.0 ++ppn $pp n +setcpuaffinity +commap 0 +pemap 2-54:2,1-55:2 dwf1.6144.param >& DIAG.$SLURM JOB_NUM_NODES.$qcount
Lowering the "-p 4096" will get it to run (but on other machines, I run this benchmark with -p 32768").
I did some projections. TimelineScreenshot.pdf Here is a timeline screen shot. Message sizes run from 880 bytes to a few 10s of kbytes. Note there are latencies of order 300 seconds! Even when PEs are on the same node!
Updated: TimelineScreenshot.pdf
Here is a screen shot that includes comm tracing. Note that the comm threads are swamped. Something else that is peculiar: the pink are updateLocation messages. No load balancing has been done, so why is charm unaware of the array element placement?
That second PDF is just showing up blank for me, but I'll look into the location messages. Does ChaNGa do anything with demand creation, or dynamic insertion? Those can also potentially trigger updateLocation messages.
Projections "Save all timelines to image" seems not to work. I've updated the PDF to be just visible PEs. (Maybe getting off topic here): The code for the chare array creation is:
CkArrayOptions opts(numTreePieces); CProxy_BlockMap myMap = CProxy_BlockMap::ckNew(); opts.setMap(myMap); CProxy_TreePiece pieces = CProxy_TreePiece::ckNew(opts);
Updated: I forgot a key line: the declaration of "opts" which I just added. 'numTreePieces' is an integer read from the command line.
So based on this, your array is using dynamic insertion, since the array is initially created with 0 elements. The elements are created at some later point either via demand creation, or via insert calls. This can result in updateLocation
calls when a PE first tries to message an element it doesn't know about. It needs to know the elements ID to pack into the message, but all it has is the index. So it sends a location request to the homePe, and the homePe replies with an updateLocation
message containing the ID of the element.
For most applications, the fix for this is to set bounds on the arrays index space. If bounds are set, then Charm++ can automatically get an ID from an index with a simple bit-packing scheme. However, if I recall correctly, ChaNGa uses a user defined array index? We don't currently have a solution for that other than the default ID creation scheme, which requires these location request/reply messages.
As a control, I ran the same simulation with the MPI SMP charm build (Intel MPI). The resulting timelines are attached. The comm thread is still being hammered; the difference is that the total time is factor of 10 smaller than UCX! One hypothesis, therefore, is that message processing by the UCX comm thread is very computationally expensive, possibly exacerbated when there are a lot of messages to process.
Projections "Save all timelines to image" seems not to work. I've updated the PDF to be just visible PEs. (Maybe getting off topic here): The code for the chare array creation is:
CkArrayOptions opts(numTreePieces); CProxy_BlockMap myMap = CProxy_BlockMap::ckNew(); opts.setMap(myMap); CProxy_TreePiece pieces = CProxy_TreePiece::ckNew(opts);
Updated: I forgot a key line: the declaration of "opts" which I just added. 'numTreePieces' is an integer read from the command line.
Hmm, that's interesting. Without dynamic creation or migration, the only other thing I can think of that would generate updateLocation
messages is array messages that take multiple bounces to get to their location, which also shouldn't be happening here.
There's also a parameter messageBufferingThreshold
which could trigger large messages to trigger updateLocation
messages, but AFAIK that parameter is off by default, unless UCX handles it differently?
UCX doesn't know anything about the "_messageBufferingThreshold" inside CkLocMgr.
If you can insert a breakpoint on CkLocMgr::multiHop() defined in charm/src/ck-core/cklocation.C that would tell us if there's something weird going on there
Yeah I just talked to @nitbhat about this, he's going to try the example Tom sent above to see if he can reproduce.
@trquinn how many array elements (numTreePieces
) are created in the example above?
@trquinn: I followed the following steps:
Building charm
module load impi/19.0.5 && which mpicc && ./build ChaNGa ucx-linux-x86_64 --enable-error-checking --suffix=changa-prod --with-production --basedir=/work/03808/nbhat4/frontera/ucx/build_1.6.1 -j16
Building ChaNGa
export CHARM_DIR=/work/03808/nbhat4/frontera/charm/ucx-linux-x86_64-smp-changa-prod
./configure
mkae -j56
Job script
#!/bin/bash
#SBATCH -J changa_2nodes
#SBATCH -p normal
#SBATCH -t 00:30:00
#SBATCH -N 2
#SBATCH -n 2
#SBATCH --ntasks-per-node=1
#SBATCH -A ASC20007
#SBATCH -o ./results/changa-output-2nodes-ucx-prod-dwf-%A.out
cd /scratch1/03808/nbhat4/changa/
module load impi which mpirun
export UCX_ZCOPY_THRESH=-1 ppn=54
echo "2 nodes, 2 ranks, 55 ppn, ucx" ibrun ./ChaNGa.prod.smp +useckloop +bench -wall 10 -v 1 -p 4096 -maxbal 10.0 ++ppn $ppn +setcpuaffinity +commap 0 +pemap 2-54:2,1-55:2 dwf1.6144.param echo "2 nodes, 2 ranks, 55 ppn, ucx"
With this, I saw that in 2 mins or so, the output reaches
Input file, Time:0.239443 Redshift:0.341011 Expansion factor:0.745706 Simulation to Time:0.239543 Redshift:0.340548 Expansion factor:0.745964 WARNING: Could not open redshift input file: dwf1.6144.bench.red Initial Domain decomposition ... Sorter: Histograms balanced after 19 iterations.
And then, after about 15 mins, I see a crash because of out of memory.
32 Loading particles ... trying Tipsy ... took 1.572218 seconds.$ 33 N: 51794908$ 34 Input file, Time:0.239443 Redshift:0.341011 Expansion factor:0.745706$ 35 Simulation to Time:0.239543 Redshift:0.340548 Expansion factor:0.745964$ 36 WARNING: Could not open redshift input file: dwf1.6144.bench.red$ 37 Initial Domain decomposition ... Sorter: Histograms balanced after 19 iterations.$ 38 total 1126.44 seconds.$ 39 Initial load balancing ... Orb3dLB_notopo: Step 0$ 40 numActiveObjects: 4096, numInactiveObjects: 0$ 41 active PROC range: 0 to 107$ 42 Migrating all: numActiveObjects: 4096, numInactiveObjects: 0$ 43 [Orb3dLB_notopo] sorting$ 44 ***$ 45 Orb3dLB_notopo stats: maxObjLoad 0.000795$ 46 Orb3dLB_notopo stats: minWall 0.521159 maxWall 0.540816 avgWall 0.522287 maxWall/avgWall 1.035475$ 47 Orb3dLB_notopo stats: minIdle 0.000000 maxIdle 0.506066 avgIdle 0.495555 minIdle/avgIdle 0.000000$ 48 Orb3dLB_notopo stats: minPred 0.007944 maxPred 0.011722 avgPred 0.010661 maxPred/avgPred 1.099506$ 49 Orb3dLB_notopo stats: minPiece 37.000000 maxPiece 39.000000 avgPiece 37.925926 maxPiece/avgPiece 1.028320$ 50 Orb3dLB_notopo stats: minBg 0.010114 maxBg 0.540137 avgBg 0.016071 maxBg/avgBg 33.608630$ 51 Orb3dLB_notopo stats: orb migrated 4080 refine migrated 0 objects$ 52 Building trees ... took 0.498366 seconds.$ 53 Maximum memory: -11 MB$ 54 Calculating gravity (tree bucket, theta = 0.700000) ... [1586382996.141156] [c191-011:425253:0] rcache.c:123 UCX ERROR mmmap(size=36864) failed: Cannot allocate memory$ 55 [1586382996.141649] [c191-011:425253:0] mpool.c:187 UCX ERROR Failed to allocate memory pool (name=rcache_inv_mp) chunk: Out of memory$ 56 [1586382996.141656] [c191-011:425253:0] rcache.c:336 UCX ERROR Failed to allocate invalidation entry for 0x2bcf40000000..0x2bcf44000000, data corruption may occur$ 57 [1586382996.141706] [c191-011:425253:0] rcache.c:123 UCX ERROR mmmap(size=36864) failed: Cannot allocate memory$
So, I'm not able to see the hang (but a crash which looks like hang because it doesn't print out anything for 15 mins or so). Let me know if I'm doing something wrong.
I'll try determining the cause of the `updateLocation` messages.
Also, I used ChaNGa's master
branch. Let me know if you were using a specific branch.
Yeah I just talked to @nitbhat about this, he's going to try the example Tom sent above to see if he can reproduce.
@trquinn how many array elements (
numTreePieces
) are created in the example above?
It looks like 4096 tree pieces are created.
Note thiat @nitbhat took over 1000 seconds to do domain decomposition with 4096 pieces. My projections where done with 2048 pieces, and took only (?) 300 seconds. The MPI version took 30 seconds. Note that I put a CkExit() in after domain decomposition so I could focus on the DD problem. I have a hunch that the later memory problem is also due to the message bottleneck we are seeing with DD. If you also want to stop after DD, make this change:
diff --git a/ParallelGravity.cpp b/ParallelGravity.cpp
index 4d13d2a..3d045e0 100644
--- a/ParallelGravity.cpp
+++ b/ParallelGravity.cpp
@@ -1662,6 +1662,7 @@ void Main::domainDecomp(int iPhase)
timings[iPhase].tDD += tDD;
CkPrintf("total %g seconds.\n", tDD);
+ CkExit();
if(verbosity && !bDoDD)
CkPrintf("Skipped DD\n");
}
In my charm, I added the following prints
login4.frontera(1024)$ git diff
diff --git a/src/ck-core/cklocation.C b/src/ck-core/cklocation.C
index 1e38fbd45..20f92ff68 100644
--- a/src/ck-core/cklocation.C
+++ b/src/ck-core/cklocation.C
@@ -2403,6 +2403,7 @@ void CkLocMgr::requestLocation(CmiUInt8 id, const int peToTell,
void CkLocMgr::updateLocation(const CkArrayIndex &idx, CmiUInt8 id, int nowOnPe) {
DEBN(("%d updateLocation for %s on %d\n", CkMyPe(), idx2str(idx), nowOnPe));
+ CmiPrintf("[%d][%d][%d] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for %s on %d\n", CkMyPe(), CkMyNode(), CkMyRank(), idx2str(idx), nowOnPe);
inform(idx, id, nowOnPe);
deliverAnyBufferedMsgs(id, bufferedRemoteMsgs);
}
@@ -2769,6 +2770,7 @@ void CkLocMgr::multiHop(CkArrayMessage *msg)
{
CK_MAGICNUMBER_CHECK
int srcPe=msg->array_getSrcPe();
+ CmiPrintf("[%d][%d][%d] ############################### multiHop and srcPe is %d\n", CkMyPe(), CkMyNode(), CkMyRank(), srcPe);
if (srcPe==CkMyPe())
DEB((AA "Odd routing: local element %u is %d hops away!\n" AB, msg->array_element_id(),msg->array_hops()));
else
@@ -3028,6 +3030,7 @@ void CkLocMgr::metaLBCallLB(CkLocRec *rec) {
*/
void CkLocMgr::immigrate(CkArrayElementMigrateMessage *msg)
{
+ CmiPrintf("[%d][%d][%d] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ immigrate\n", CkMyPe(), CkMyNode(), CkMyRank());
const CkArrayIndex &idx=msg->idx;
PUP::fromMem p(msg->packData);
In my output, I initially see a lot of updateLocation prints (have not added all updateLocation prints)
66 Charm++> Running on 2 hosts (2 sockets x 28 cores x 1 PUs = 56-way SMP)$
67 Charm++> cpu topology info is gathered in 0.018 seconds.$
68 WARNING: +useckloop is a command line argument beginning with a '+' but was not parsed by the RTS.$
69 WARNING: +bench is a command line argument beginning with a '+' but was not parsed by the RTS.$
70 If any of the above arguments were intended for the RTS you may need to recompile Charm++ with different options.$
71 [0] MultistepLB_notopo created$
72 WARNING: bStandard parameter ignored; Output is always standard.$
73 Using CkLoop 1$
74 ChaNGa version 3.4, commit v3.4-57-g3f4947c$
75 Running on 108 processors/ 2 nodes with 4096 TreePieces$
76 yieldPeriod set to 5$
77 Prefetching...ON$
78 Number of chunks for remote tree walk set to 1$
79 Chunk Randomization...ON$
80 cache 1$
81 cacheLineDepth 4$
82 Verbosity level 1$
83 Domain decomposition...SFC Peano-Hilbert$
84 CkLoopLib is used in SMP with simple dynamic scheduling (converse-level notification)$
85 Created 4096 pieces of tree$
86 [0][0][0] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for on 17$
87 [6][0][6] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for 6 on 0$
88 [0][0][0] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for 19 on 51$
89 [17][0][17] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for 17 on 0$
But I don't see multiHop messages until
4142 [21][0][21] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for 3909 on 102$
4143 [21][0][21] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for 2505 on 65$
4144 [21][0][21] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation for 3153 on 82$
4145 took 1.108604 seconds.$
4146 N: 51794908$
4147 Input file, Time:0.239443 Redshift:0.341011 Expansion factor:0.745706$
4148 Simulation to Time:0.239543 Redshift:0.340548 Expansion factor:0.745964$
4149 WARNING: Could not open redshift input file: dwf1.6144.bench.red$
4150 Initial Domain decomposition ... Sorter: Histograms balanced after 19 iterations.$
4151 [82][1][28] ############################### multiHop and srcPe is 88$
4152 [82][1][28] ############################### multiHop and srcPe is 86$
4153 [82][1][28] ############################### multiHop and srcPe is 67$
So, it looks like multiHop is not involved in the code path involving updateLocation.
I suspect one flow in UCX ML, let me prepare a patch to try
On @jcphill's suggestion, I made the following changes to NAMD to trigger a QD before the minimization phase (where the hang is seen).
diff --git a/src/Controller.C b/src/Controller.C
index f5d8d172..bac79b4e 100644
--- a/src/Controller.C
+++ b/src/Controller.C
@@ -284,6 +284,7 @@ void Controller::algorithm(void)
int scriptSeq = 0;
BackEnd::awaken();
while ( (scriptTask = broadcast->scriptBarrier.get(scriptSeq++)) != SCRIPT_END) {
+ BackEnd::awaken();
switch ( scriptTask ) {
case SCRIPT_OUTPUT:
enqueueCollections(FILE_OUTPUT);
@@ -403,7 +404,6 @@ void Controller::algorithm(void)
default:
NAMD_bug("Unknown task in Controller::algorithm");
}
- BackEnd::awaken();
}
enqueueCollections(END_OF_RUN);
outputExtendedSystem(END_OF_RUN);
diff --git a/src/Node.C b/src/Node.C
index d4f70ac8..42e762ed 100644
--- a/src/Node.C
+++ b/src/Node.C
@@ -1335,11 +1335,13 @@ void Node::enableExitScheduler(void) {
if ( CkMyPe() ) {
sendEnableExitScheduler();
} else {
+ CmiPrintf("[%d][%d][%d] CkStartQd\n", CmiMyPe(), CmiMyNode(), CmiMyRank());
CkStartQD(CkIndex_Node::exitScheduler(), &thishandle);
}
}
void Node::exitScheduler(void) {
+ CmiPrintf("[%d][%d][%d] Node::exitScheduler\n", CmiMyPe(), CmiMyNode(), CmiMyRank());
//CmiPrintf("exitScheduler %d\n",CkMyPe());
CsdExitScheduler();
}
On running this without load balancing, I see that the hang is reached, but quiescence is not.
NAMD Output: namd-output-16nodes-smp-ucx-prod-jim4-723992.out.txt -> Look at CkStartQd
printed before minimization.
This strengthens the hypothesis of there being missing/undelivered message(s).
@epmikida I saw that CkLocMgr::informHome
calls updateLocation
in the beginning (before Initial Domain decomposition
is printed) and CkLocMgr::multiHop
calls updateLocation
after Initial Domain decomposition
is printed. Does that indicate anything weird going on?
Hmm...that's weird. informHome
is called during migration, and creation of elements to let the homePe know where the element is. So in this case, it's called via element creation.
But, the only time it sends updateLocation
messages is when the element is not created on it's homePe
. I don't know why that would be the case here, since it is just doing static creation? Unless there's something weird in the map?
@trquinn, @nitbhat, Can you please try running with #2799 patch? I'd expect it to fix UCX ML slowness in certain cases (when lots of small messages are arriving)
Has anyone checked if there is a bug in PE mapping on frontera? If all processes do not agree on where the some PE is, you might get this sort of behavior.
On Fri, Apr 10, 2020 at 5:09 AM Mikhail Brinskiy notifications@github.com wrote:
@trquinn https://github.com/trquinn, @nitbhat https://github.com/nitbhat, Can you please try running with #2799 https://github.com/UIUC-PPL/charm/pull/2799 patch? I'd expect it to fix UCX ML slowness in certain cases (when lots of small messages are arriving)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UIUC-PPL/charm/issues/2716#issuecomment-611970038, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3HFH2CQ2AVRXFAQSGCGZLRL3V5RANCNFSM4KYAHJWA .
@epmikida:
I added the following prints
Inside createLocal:
CmiPrintf(“[%d][%d][%d] ***************** CkLocMgr::createLocal, calling informHome id:%s and now on pe:%d\n”, CkMyPe(), CkMyNode(), CkMyRank(), idx2str(idx), CkMyPe());
Inside updateLocation
CmiPrintf(“[%d][%d][%d] &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& updateLocation 1 for id:%s on %d\n”, CkMyPe(), CkMyNode(), CkMyRank(), idx2str(idx), nowOnPe);
Inside informHome:
CmiPrintf(“[%d][%d][%d] ***************** CkLocMgr::informHome, calling updateLocation 1 to move id:%s to %d (nowOnPe:%d)\n”, CkMyPe(), CkMyNode(), CkMyRank(), idx2str(idx), home, nowOnPe);
And with that (and some other non-relevant prints), I see the following output: changa-output-2nodes-ucx-debug-dwf-725634.out.txt
It's the createLocal
calls that invoke informHome
and finally updateLocation
.
From the output I see that 4059 of the 4096 chares are calling updateLocation
from informHome
(so 4059 chares meet the condition home!=CkMyPe() && home!=nowOnPe
in informHome).
Yeah this is really weird. Based on the output, 4059 chares think they are being created on a PE other than their home, so they are sending an updateLocation
to their home so that it knows where the element currently is.
Possibilities for why this might happen:
After some manual inspection of the output, the homePe seems correct for each element. ie element 1 thinks its homePe is 1, 2 thinks its homePe is 2, etc. So the homePe calculation of the BlockMap is probably not the culprit.
One other oddity in the output though is that there were duplicates of some elements (ie element 4049 appears to be created on two different PEs. There are also some elements that are created that shouldn't be. For example, the output has this line:
[38][0][38] ***************** CkLocMgr::informHome, calling updateLocation 1 to move id:9199 to 45 (nowOnPe:38)
These could possibly just be mangled output? But still weird.
I've attached the output from ChaNGa when running it on the MPI layer instead of UCX. I still see 4059 chares that are calling updateLocation.
changa-output-2nodes-mpi-debug-dwf-728723.out.txt
yeah, I'm not sure how that id:9199 showed up. (hoping that it was just mangled output, but I can watch out for weird ids).
I've attached the output from ChaNGa when running it on the MPI layer instead of UCX. I still see 4059 chares that are calling updateLocation.
changa-output-2nodes-mpi-debug-dwf-728723.out.txt
yeah, I'm not sure how that id:9199 showed up. (hoping that it was just mangled output, but I can watch out for weird ids).
That one also has a number of weird ID's at the end, and some duplicated chares. Another odd thing is that the duplication doesn't show up in the createLocal
call, just the informHome
call, and the two calls don't seem to agree on the homePe of the object.
Steps to reproduce:
Download charm and build
ucx-linux-x86_64-smp
build with charm-6.10.0Download NAMD and build
Linux-x86_64-g++
Run NAMD on 16 nodes with the following commands
Run command: (from the script)
./Linux-x86_64-g++/charmrun +p 832 ./Linux-x86_64-g++/namd2 ++ppn 13 +pemap 4-55:2,5-55:2 +commap 0,2,1,3 ./runZIKV-50M-atoms.namd ++mpiexec ++remote-shell ibrun
Job submit command:
sbatch --job-name=NAMD --nodes=16 --ntasks=64 --time=00:20:00 --partition=normal
This issue seems to be related to the other UCX issues on Frontera: https://github.com/UIUC-PPL/charm/issues/2635 https://github.com/UIUC-PPL/charm/issues/2636