charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
200 stars 50 forks source link

Bugs exposed by use of randomized Q #259

Open nikhil-jain opened 10 years ago

nikhil-jain commented 10 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/259


Using randomized queues from branch ramv/randomized-msgq, autobuild suffers from crash/hang in several places. Some of them, that I was able to note, are listed here (occurs once in a while).

I suspect there are more such issues, but given the enormity of task, I was not able to capture them all. The bugs may be in the example programs, or inside Charm.

ericjbohm commented 5 years ago

Original date: 2013-08-05 22:26:34


Which build is producing these failures?

git pull --rebase git checkout remotes/origin/ramv/randomized-msgq ./build charm++ net-linux-x86_64 -j16 --suffix randq-debug -O0 -g

I get no megatest failures from that on intellect at +p4, +p8, +p10

Is this perhaps an -O2 only issue?

PhilMiller commented 5 years ago

Original date: 2013-08-07 19:23:40


Eric, per Ram's commit message on that branch, you need to build with --enable-randomized-msgq. This apparently requires a priotype other than bitvec as well:

build: --enable-randomized-msgq should now aid debugging / correctness

This commit introduces support for randomizing the scheduler queue. Message priorities will not be respected when picking the next charm message for execution This is intended to facilitate application debugging and should help detect:

  • applications / protocol that erroneously assume a message ordering between any sender / receiver pair
  • applications which erroneously depend on priorities for correctness
  • race conditions in the application

Requirements:

  • Currently, a charm build which has the STL-based msg queue enabled.
  • This requires that charm be built --with-prio-type != bitvec
  • Supported prio types can be any datatype that can be used as a template parameter
PhilMiller commented 5 years ago

Original date: 2013-08-07 20:00:00


A trick for the people working on the megatest failures: it comes with a command line flag -repeat that makes multiple runs very easy, since it will keep going until the process aborts or gets killed.

ericjbohm commented 5 years ago

Original date: 2013-08-07 22:12:25


multisectiontest had a race condition in its setup which implicitly assumed an ordering between two reductions. It now (commit 165d1f63ea6b3e4ad9c881910dac68554d3785e0) properly waits for both to complete before triggering that actual test once both parts of its setup process are complete.

Megatest is now crashing for me (after many runs) in varraystest, which must be a different bug.

ericjbohm commented 5 years ago

Original date: 2013-08-08 19:03:08


The various var tests all have a logical flaw of assuming equivalence in floating point numbers. This can create occasional false positive errors from round off. Not really related to randq per se, but after fixing these comparisons I have megatest running thousands of tests w/repeat in the randomizeQ build without fault. Megatest with -repeat does have a memory leak (probably many), which we should address at some point.

lifflander commented 5 years ago

Original date: 2013-08-13 01:26:09


Can someone give the exact steps for building/running with the randomized queue?

It seems this is not sufficient:

./build charm++ net-linux-x86_64 --with-prio-type=int --enable-randomized-msgq -j12 -O3 -g

PhilMiller commented 5 years ago

Original date: 2013-08-28 19:02:56


Please retest after recent fixes, and open distinct child issues for distinct detected problems.

epmikida commented 5 years ago

Original date: 2015-02-11 22:51:01


I am re-running tests, and breaking out individual bugs for failing tests. This branch has been merged into charm, so the process I'm using for most of the tests is the following:

checkout the latest charm branch ./build charm++ net-linux-x86_64 --with-prio-type=int --enable-randomized-msgq -j16 --suffix randq-debug -O3 -g export TESTOPTS="++local" (For some tests I also added a +p option)

I'm then running each of the tests in the tests directory a few dozen times on panache. Below are tests that either crash or hang. A child bug will be made for each one. There may be a bug in randomized queues, or it could just be the fact that the test is not written to take randomized queues into account. I ignored test directories without test targets (ie: io, jacobi), as well as tests not included in the master Makefile (ie: hello_crosscoruption, ping). I also verified that all of these tests pass when not using randomized queues.

charm++/alignment: Crashes, failed assertion. charm++/communication_overhead: Hangs charm++/delegation/multicast: Hangs charm++/queue: Hangs (despite being aware of randomized queueing via an #if) charm++/sdag/migration: Hangs

All converse tests pass.

./build AMPI net-linux-x86_64 --with-prio-type=int --enable-randomized-msgq -j16 --suffix randq-debug -O3 -g

ampi/megampi: Crashes rarely due to a failed assertion: Broadcast integer from master> expected 123, got 4!