Open stwhite91 opened 6 years ago
Original date: 2019-03-14 19:30:33
Michael, make at least some update (test with a simple program on a couple of machines) by next week.
Original date: 2019-03-28 17:25:17
Tried to replicate using fib on various platforms and machines:
Original date: 2019-03-28 17:30:54
What do you mean by it wouldn't build for netlrts and mpi? Charm didn't build, or the example didn't build? Can you post the output?
Original date: 2019-03-28 17:41:04
In both cases, charm failed to build. In the first (netlrts) case adding a fixed width priority (e.g. int) enabled charm to build. -I don't have the output but I can recreate it and post it here.-
Here's the build line from comet:
./build charm++ mpi-linux-x86_64 smp --enable-randomized-msgq -j8
And the build error:
checking "whether C++ compiler supports C++11 with '-h std=c++11'"... "no"
Charm++ requires C++11 support, but doesn't know the flag to enable it
For Intel's compiler please see
https://charm.cs.illinois.edu/redmine/issues/1560
about making a suitable version of gcc/g++/libstdc++ available
For Blue Gene/Q please use the Clang compiler
*** Please find detailed output in tmp/charmconfig.out ***
gmake[1]: Leaving directory `/home/mprobson/charm/mpi-linux-x86_64-smp/tmp'
gmake: *** [headers] Error 2
-------------------------------------------------
Charm++ NOT BUILT. Either cd into mpi-linux-x86_64-smp/tmp and try
to resolve the problems yourself, visit
http://charm.cs.illinois.edu/
for more information. Otherwise, email the developers at charm`cs.illinois.edu
Turns out for courage it was also dying on the priotype incompatbility. Changing it from the default of bitvec to int fixes the problem and replicates the hang.
Original date: 2019-03-28 21:25:20
With some further testing, this actually appears to be an error due to the combination of SMP mode and non-bitvec/fixed length priorities, which we are forced to use with randomized queues.
Adding some detail here. The hang also shows up in the (currently in review) example program for within node broadcasts: https://charm.cs.illinois.edu/gerrit/c/charm/+/5068
The hang can be replicated on my linux machine using the following charm build:
./build LIBS netlrts-linux-x86_64 smp --enable-randomized-msgq --with-prio-type=int
In this case though, the node group seems to get created just fine. The printout in the nodegroup constructor prints, and control returns to main c-tor which completes and we hit the runTests
method in main which means charm initialization was able to fully complete. The hang occurs when sending any messages to the nodegroup.
I tried both broadcasts, and changing the tests to send p2p messages and the hang still occurs. The prints in the nodegroup methods never appear, and even add QD doesn't side-step the hang. So the messages are clearly being sent, and Charm++ is aware it is waiting for them, but the for some reason never get received.
It replicates with +p8 ++ppn8
, +p8 ++ppn4
, +p1 ++ppn1
(as well as probably others), so it doesn't matter if you have multiple nodes, a single node, or even a single core.
Per discussion with @epmikida and @stwhite91 on Slack, I implemented randomized queues to work in the default (no priority) case. The sample application fib works fine, i.e. does not hang, when compiled with smp and randomized queues but no priority. So we have pretty clear that this is a bad interaction between smp and fixed length priorities, e.g. double and int. This includes building charm with -DCMK_NO_MSG_PRIOS=1
passed directly as part of the ./build
line (thanks @evan-charmworks ).
Can you make a pull request to merge that change in? I see no reason to limit randomized queues to work only with priorities.
Sure, but I need to figure out how to setup configure correctly. Right now it requires prio != bitvec but I explicitly disable priorities by passing -DCMK_NO_MST_PRIOS=1. Unfortunately getting rid of the check in configure isn't the right thing. Maybe adding a new flag where prio=no or something? Suggestions?
Interestingly, fib hangs in singleton chare creation even if I move it outside the main constructor e.g. main -> 1d array size 1 -> singelton hangs at the singelton step, but they are all still nested. Splitting construction into two parts main -> 1d array size 1 and main -> array entry method -> singelton constructor also hangs. Finally, calling main -> 1d array constructor, main -> 1 d entry -> 1d entry -> singleton constructor hangs as well.
Any recent work on this to report?
Original issue: https://charm.cs.illinois.edu/redmine/issues/1940
examples/charm++/fib hangs in SMP mode when using randomized queues.
This issue and the nodegroup one can also be reproduced here: https://github.com/yuchenp/smp-rq-problem