charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
207 stars 50 forks source link

Singleton chare and nodegroup creation hangs with non-bitvec queues in SMP mode #1940

Open stwhite91 opened 6 years ago

stwhite91 commented 6 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/1940


examples/charm++/fib hangs in SMP mode when using randomized queues.

This issue and the nodegroup one can also be reproduced here: https://github.com/yuchenp/smp-rq-problem

lvkale commented 5 years ago

Original date: 2019-03-14 19:30:33


Michael, make at least some update (test with a simple program on a couple of machines) by next week.

mprobson commented 5 years ago

Original date: 2019-03-28 17:25:17


Tried to replicate using fib on various platforms and machines:

stwhite91 commented 5 years ago

Original date: 2019-03-28 17:30:54


What do you mean by it wouldn't build for netlrts and mpi? Charm didn't build, or the example didn't build? Can you post the output?

mprobson commented 5 years ago

Original date: 2019-03-28 17:41:04


In both cases, charm failed to build. In the first (netlrts) case adding a fixed width priority (e.g. int) enabled charm to build. -I don't have the output but I can recreate it and post it here.-

Here's the build line from comet:

./build charm++ mpi-linux-x86_64 smp --enable-randomized-msgq -j8

And the build error:

checking "whether C++ compiler supports C++11 with '-h std=c++11'"... "no"
Charm++ requires C++11 support, but doesn't know the flag to enable it

For Intel's compiler please see
https://charm.cs.illinois.edu/redmine/issues/1560
about making a suitable version of gcc/g++/libstdc++ available

For Blue Gene/Q please use the Clang compiler
*** Please find detailed output in tmp/charmconfig.out ***
gmake[1]: Leaving directory `/home/mprobson/charm/mpi-linux-x86_64-smp/tmp'
gmake: *** [headers] Error 2
-------------------------------------------------
Charm++ NOT BUILT. Either cd into mpi-linux-x86_64-smp/tmp and try
to resolve the problems yourself, visit
http://charm.cs.illinois.edu/
for more information. Otherwise, email the developers at charm`cs.illinois.edu

Turns out for courage it was also dying on the priotype incompatbility. Changing it from the default of bitvec to int fixes the problem and replicates the hang.

mprobson commented 5 years ago

Original date: 2019-03-28 21:25:20


With some further testing, this actually appears to be an error due to the combination of SMP mode and non-bitvec/fixed length priorities, which we are forced to use with randomized queues.

epmikida commented 5 years ago

Adding some detail here. The hang also shows up in the (currently in review) example program for within node broadcasts: https://charm.cs.illinois.edu/gerrit/c/charm/+/5068

The hang can be replicated on my linux machine using the following charm build: ./build LIBS netlrts-linux-x86_64 smp --enable-randomized-msgq --with-prio-type=int

In this case though, the node group seems to get created just fine. The printout in the nodegroup constructor prints, and control returns to main c-tor which completes and we hit the runTests method in main which means charm initialization was able to fully complete. The hang occurs when sending any messages to the nodegroup.

I tried both broadcasts, and changing the tests to send p2p messages and the hang still occurs. The prints in the nodegroup methods never appear, and even add QD doesn't side-step the hang. So the messages are clearly being sent, and Charm++ is aware it is waiting for them, but the for some reason never get received.

It replicates with +p8 ++ppn8, +p8 ++ppn4, +p1 ++ppn1 (as well as probably others), so it doesn't matter if you have multiple nodes, a single node, or even a single core.

mprobson commented 5 years ago

Per discussion with @epmikida and @stwhite91 on Slack, I implemented randomized queues to work in the default (no priority) case. The sample application fib works fine, i.e. does not hang, when compiled with smp and randomized queues but no priority. So we have pretty clear that this is a bad interaction between smp and fixed length priorities, e.g. double and int. This includes building charm with -DCMK_NO_MSG_PRIOS=1 passed directly as part of the ./build line (thanks @evan-charmworks ).

epmikida commented 5 years ago

Can you make a pull request to merge that change in? I see no reason to limit randomized queues to work only with priorities.

mprobson commented 5 years ago

Sure, but I need to figure out how to setup configure correctly. Right now it requires prio != bitvec but I explicitly disable priorities by passing -DCMK_NO_MST_PRIOS=1. Unfortunately getting rid of the check in configure isn't the right thing. Maybe adding a new flag where prio=no or something? Suggestions?

mprobson commented 5 years ago

Interestingly, fib hangs in singleton chare creation even if I move it outside the main constructor e.g. main -> 1d array size 1 -> singelton hangs at the singelton step, but they are all still nested. Splitting construction into two parts main -> 1d array size 1 and main -> array entry method -> singelton constructor also hangs. Finally, calling main -> 1d array constructor, main -> 1 d entry -> 1d entry -> singleton constructor hangs as well.

ericjbohm commented 4 years ago

Any recent work on this to report?