NAMD hangs with UCX machine layer (in SMP mode) on Frontera

nitbhat commented 4 years ago

Steps to reproduce:

Download charm and build ucx-linux-x86_64-smp build with charm-6.10.0
Download NAMD and build Linux-x86_64-g++
Run NAMD on 16 nodes with the following commands

Run command: (from the script) ./Linux-x86_64-g++/charmrun +p 832 ./Linux-x86_64-g++/namd2 ++ppn 13 +pemap 4-55:2,5-55:2 +commap 0,2,1,3 ./runZIKV-50M-atoms.namd ++mpiexec ++remote-shell ibrun

Job submit command: sbatch --job-name=NAMD --nodes=16 --ntasks=64 --time=00:20:00 --partition=normal

This issue seems to be related to the other UCX issues on Frontera: https://github.com/UIUC-PPL/charm/issues/2635 https://github.com/UIUC-PPL/charm/issues/2636

nitbhat commented 4 years ago

Has anyone checked if there is a bug in PE mapping on frontera? If all processes do not agree on where the some PE is, you might get this sort of behavior. … On Fri, Apr 10, 2020 at 5:09 AM Mikhail Brinskiy @.***> wrote: @trquinn https://github.com/trquinn, @nitbhat https://github.com/nitbhat, Can you please try running with #2799 <#2799> patch? I'd expect it to fix UCX ML slowness in certain cases (when lots of small messages are arriving) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2716 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3HFH2CQ2AVRXFAQSGCGZLRL3V5RANCNFSM4KYAHJWA .

@ericjbohm when you say pe-mapping, you're not referring to the +pemap argument, right? Which data structures can I probe to check if the PE-mapping is consistent across all PEs?

nitbhat commented 4 years ago

On @ericmikida's suggestion, I tried running ChaNGa with maps other than the BlockMap i.e. RRMap and DefaultMap, and I couldn't see any updateLocation msgs in the logs. (for both MPI and UCX). So, it definitely looks like BlockMap is buggy.

With RRMap and DefaultMap, the application goes to completion and doesn't crash (I'm still CkExiting after DD). I haven't tried with Mikhail's latest patch, but without it UCX is taking much much longer for DD.

MPI - Default Map - 14.172 seconds. MPI - RR Map - 4.81116 seconds. UCX - Default Map - 187.934 seconds. UCX - RR Map - 185.153 seconds.

epmikida commented 4 years ago

The bug we are seeing in ChaNGa looks to be due to a somewhat major bug in BlockMap. The populateInitial and procNum functions do not agree with each other, so the initial creation of elements puts the majority of them on PEs that are not considered their home according tot he runtime.

This shouldn't cause correctness issues, but will cause the huge flood of updateLocation messages that is causing slow down here. It is not UCX/Frontera specific.

PhilMiller commented 4 years ago

The BlockMap bug can absolutely cause correctness issues, if it's even worse, and populateInitial doesn't produce unique results - i.e. multiple PEs both insert the same element.

With off-line tests, it wouldn't be too hard to catch this sort of thing. With online tests, it's easy for an assert to catch the case where a PE with a particular element on it gets an updateLocation or informHome message saying that same element is elsewhere.

PhilMiller commented 4 years ago

Actually, one can do better in a debug build: just have the code that calls/is-called-by populateInitial check that procNum of every object it inserts matches the PE it's running on!

epmikida commented 4 years ago

The BlockMap bug can absolutely cause correctness issues, if it's even worse, and populateInitial doesn't produce unique results - i.e. multiple PEs both insert the same element.

With off-line tests, it wouldn't be too hard to catch this sort of thing. With online tests, it's easy for an assert to catch the case where a PE with a particular element on it gets an updateLocation or informHome message saying that same element is elsewhere.

Yes, that's true about populateInitial, although looking at it, it seems to be a perfectly valid populateInitial function, just not one that's compatible with the procNum.

For your second suggestion, the default populateInitial does just that: it creates an element on a PE iff procNum for that index returns that PE. Issue was that for the BlockMap, it decided to override populateInitial and not procNum. My best guess as to why was that populateInitial gets CkArrayOptions passed to it so you can calculate bin size. But, other maps use different methods to deal with this.

Overall, I think the mapping infrastructure needs a bit of an update. It's usage is pretty inconsistent.

nitbhat commented 4 years ago

I updated ucx to a point in the master branch (https://github.com/openucx/ucx/commit/b72cd117d474a3963640db72a1ac4de4e7442c81) and I see that NAMD doesn't hang anymore (as it did with ucx 1.8.0).

In my experiment, I ran about 30 16 node jobs with the zikv simulation with reduced number of steps. (minimize 200, run 1000). This result is contrastingly different than the result with ucx 1.8.0 where with the same input, NAMD hung in every run.

I also tried running 1stmv2fs.namd (on 4 and 16 nodes),
20stmv2fs.namd (on 4 and 16 nodes) and 210stmv2fs.namd (on 16 and 64 nodes).

All of these except 210stmv2fs.namd on 16 nodes ran successfully to completion. The 16 node run with 210stmv2fs.namd crashed because it looks like it ran out of memory.

^ A similar result was seen for Enzo/P as mentioned here.

I am yet to git bisect the ucx repo to determine the commit(s) that cause(s) the hang to not occur anymore.

nitbhat commented 4 years ago

On bisecting, I found that this commit (https://github.com/openucx/ucx/commit/7147812f5b9449dc1d8b3ebb0ef70d65dca4a8d9) seems to be the one that is fixing the hang. And the commit message mentions the fix to a ‘deadlock’. Do you think this is applicable to what was seen on Frontera? If so, can the bug be characterized as the deadlock as described in the commit message? @brminich

nitbhat commented 4 years ago

On testing NAMD with the latest UCX release candidate v1.8.1-rc1, I see that NAMD continues to hang (as it did previously with v1.6.1 and v1.8.0). However, on testing with the current master, NAMD doesn't hang. Similar behavior is seen with Enzo as well i.e. Enzo hangs with ucx 1.8.1-rc1, but not with current master.

It looks like UCX releases diverge quite a bit from their master branch. @brminich do you know of any bug fixes that didn't make it to v1.8.1?

brminich commented 4 years ago

@nitbhat, there are quite many of them. can you please check whether the hang is gone when you set UCX_RC_FC_ENABLE=n? BTW does setting UCX_MAX_RNDV_RAILS=1 help with 1.8.1-rc1? This could help to identify the exact issue

nitbhat commented 4 years ago

@brminich: I tried both UCX_RC_FC_ENABLE=n and UCX_MAX_RNDV_RAILS=1 with 1.8.1-rc1.Enzo and NAMD hang in both cases i.e. the bug is not fixed with either of those options.

nitbhat commented 4 years ago

I verified that this issue is solved when using UCX v1.9.0-rc1 (https://github.com/openucx/ucx/releases/tag/v1.9.0-rc1). The fix should be available in the upcoming UCX release.

charmplusplus / charm

NAMD hangs with UCX machine layer (in SMP mode) on Frontera #2716