Closed nitbhat closed 4 years ago
Has anyone checked if there is a bug in PE mapping on frontera? If all processes do not agree on where the some PE is, you might get this sort of behavior. … On Fri, Apr 10, 2020 at 5:09 AM Mikhail Brinskiy @.***> wrote: @trquinn https://github.com/trquinn, @nitbhat https://github.com/nitbhat, Can you please try running with #2799 <#2799> patch? I'd expect it to fix UCX ML slowness in certain cases (when lots of small messages are arriving) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2716 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3HFH2CQ2AVRXFAQSGCGZLRL3V5RANCNFSM4KYAHJWA .
@ericjbohm when you say pe-mapping, you're not referring to the +pemap
argument, right? Which data structures can I probe to check if the PE-mapping is consistent across all PEs?
On @ericmikida's suggestion, I tried running ChaNGa with maps other than the BlockMap i.e. RRMap and DefaultMap, and I couldn't see any updateLocation
msgs in the logs. (for both MPI and UCX). So, it definitely looks like BlockMap is buggy.
With RRMap and DefaultMap, the application goes to completion and doesn't crash (I'm still CkExiting after DD). I haven't tried with Mikhail's latest patch, but without it UCX is taking much much longer for DD.
MPI - Default Map - 14.172 seconds. MPI - RR Map - 4.81116 seconds. UCX - Default Map - 187.934 seconds. UCX - RR Map - 185.153 seconds.
The bug we are seeing in ChaNGa looks to be due to a somewhat major bug in BlockMap
. The populateInitial
and procNum
functions do not agree with each other, so the initial creation of elements puts the majority of them on PEs that are not considered their home according tot he runtime.
This shouldn't cause correctness issues, but will cause the huge flood of updateLocation
messages that is causing slow down here. It is not UCX/Frontera specific.
The BlockMap
bug can absolutely cause correctness issues, if it's even worse, and populateInitial
doesn't produce unique results - i.e. multiple PEs both insert the same element.
With off-line tests, it wouldn't be too hard to catch this sort of thing. With online tests, it's easy for an assert
to catch the case where a PE with a particular element on it gets an updateLocation
or informHome
message saying that same element is elsewhere.
Actually, one can do better in a debug build: just have the code that calls/is-called-by populateInitial
check that procNum
of every object it inserts matches the PE it's running on!
The
BlockMap
bug can absolutely cause correctness issues, if it's even worse, andpopulateInitial
doesn't produce unique results - i.e. multiple PEs both insert the same element.With off-line tests, it wouldn't be too hard to catch this sort of thing. With online tests, it's easy for an
assert
to catch the case where a PE with a particular element on it gets anupdateLocation
orinformHome
message saying that same element is elsewhere.
Yes, that's true about populateInitial
, although looking at it, it seems to be a perfectly valid populateInitial
function, just not one that's compatible with the procNum
.
For your second suggestion, the default populateInitial
does just that: it creates an element on a PE iff procNum
for that index returns that PE. Issue was that for the BlockMap, it decided to override populateInitial
and not procNum
. My best guess as to why was that populateInitial
gets CkArrayOptions
passed to it so you can calculate bin size. But, other maps use different methods to deal with this.
Overall, I think the mapping infrastructure needs a bit of an update. It's usage is pretty inconsistent.
I updated ucx to a point in the master branch (https://github.com/openucx/ucx/commit/b72cd117d474a3963640db72a1ac4de4e7442c81) and I see that NAMD doesn't hang anymore (as it did with ucx 1.8.0).
In my experiment, I ran about 30 16 node
jobs with the zikv simulation with reduced number of steps. (minimize 200, run 1000). This result is contrastingly different than the result with ucx 1.8.0 where with the same input, NAMD hung in every run.
I also tried running
1stmv2fs.namd
(on 4 and 16 nodes),
20stmv2fs.namd
(on 4 and 16 nodes) and
210stmv2fs.namd
(on 16 and 64 nodes).
All of these except 210stmv2fs.namd
on 16 nodes ran successfully to completion. The 16 node run with 210stmv2fs.namd
crashed because it looks like it ran out of memory.
^ A similar result was seen for Enzo/P as mentioned here.
I am yet to git bisect
the ucx repo to determine the commit(s) that cause(s) the hang to not occur anymore.
On bisecting, I found that this commit (https://github.com/openucx/ucx/commit/7147812f5b9449dc1d8b3ebb0ef70d65dca4a8d9) seems to be the one that is fixing the hang. And the commit message mentions the fix to a ‘deadlock’. Do you think this is applicable to what was seen on Frontera? If so, can the bug be characterized as the deadlock as described in the commit message? @brminich
On testing NAMD with the latest UCX release candidate v1.8.1-rc1
, I see that NAMD continues to hang (as it did previously with v1.6.1 and v1.8.0). However, on testing with the current master, NAMD doesn't hang. Similar behavior is seen with Enzo as well i.e. Enzo hangs with ucx 1.8.1-rc1, but not with current master.
It looks like UCX releases diverge quite a bit from their master
branch. @brminich do you know of any bug fixes that didn't make it to v1.8.1
?
@nitbhat, there are quite many of them.
can you please check whether the hang is gone when you set UCX_RC_FC_ENABLE=n
?
BTW does setting UCX_MAX_RNDV_RAILS=1
help with 1.8.1-rc1
?
This could help to identify the exact issue
@brminich: I tried both UCX_RC_FC_ENABLE=n
and UCX_MAX_RNDV_RAILS=1
with 1.8.1-rc1
.Enzo and NAMD hang in both cases i.e. the bug is not fixed with either of those options.
I verified that this issue is solved when using UCX v1.9.0-rc1 (https://github.com/openucx/ucx/releases/tag/v1.9.0-rc1). The fix should be available in the upcoming UCX release.
Steps to reproduce:
Download charm and build
ucx-linux-x86_64-smp
build with charm-6.10.0Download NAMD and build
Linux-x86_64-g++
Run NAMD on 16 nodes with the following commands
Run command: (from the script)
./Linux-x86_64-g++/charmrun +p 832 ./Linux-x86_64-g++/namd2 ++ppn 13 +pemap 4-55:2,5-55:2 +commap 0,2,1,3 ./runZIKV-50M-atoms.namd ++mpiexec ++remote-shell ibrun
Job submit command:
sbatch --job-name=NAMD --nodes=16 --ntasks=64 --time=00:20:00 --partition=normal
This issue seems to be related to the other UCX issues on Frontera: https://github.com/UIUC-PPL/charm/issues/2635 https://github.com/UIUC-PPL/charm/issues/2636