Open nikhil-jain opened 10 years ago
Original date: 2013-08-05 22:26:34
Which build is producing these failures?
git pull --rebase git checkout remotes/origin/ramv/randomized-msgq ./build charm++ net-linux-x86_64 -j16 --suffix randq-debug -O0 -g
I get no megatest failures from that on intellect at +p4, +p8, +p10
Is this perhaps an -O2 only issue?
Original date: 2013-08-07 19:23:40
Eric, per Ram's commit message on that branch, you need to build with --enable-randomized-msgq
. This apparently requires a priotype other than bitvec as well:
build: --enable-randomized-msgq should now aid debugging / correctness
This commit introduces support for randomizing the scheduler queue. Message priorities will not be respected when picking the next charm message for execution This is intended to facilitate application debugging and should help detect:
- applications / protocol that erroneously assume a message ordering between any sender / receiver pair
- applications which erroneously depend on priorities for correctness
- race conditions in the application
Requirements:
- Currently, a charm build which has the STL-based msg queue enabled.
- This requires that charm be built --with-prio-type != bitvec
- Supported prio types can be any datatype that can be used as a template parameter
Original date: 2013-08-07 20:00:00
A trick for the people working on the megatest failures: it comes with a command line flag -repeat
that makes multiple runs very easy, since it will keep going until the process aborts or gets killed.
Original date: 2013-08-07 22:12:25
multisectiontest had a race condition in its setup which implicitly assumed an ordering between two reductions. It now (commit 165d1f63ea6b3e4ad9c881910dac68554d3785e0) properly waits for both to complete before triggering that actual test once both parts of its setup process are complete.
Megatest is now crashing for me (after many runs) in varraystest, which must be a different bug.
Original date: 2013-08-08 19:03:08
The various var tests all have a logical flaw of assuming equivalence in floating point numbers. This can create occasional false positive errors from round off. Not really related to randq per se, but after fixing these comparisons I have megatest running thousands of tests w/repeat in the randomizeQ build without fault. Megatest with -repeat does have a memory leak (probably many), which we should address at some point.
Original date: 2013-08-13 01:26:09
Can someone give the exact steps for building/running with the randomized queue?
It seems this is not sufficient:
./build charm++ net-linux-x86_64 --with-prio-type=int --enable-randomized-msgq -j12 -O3 -g
Original date: 2013-08-28 19:02:56
Please retest after recent fixes, and open distinct child issues for distinct detected problems.
Original date: 2015-02-11 22:51:01
I am re-running tests, and breaking out individual bugs for failing tests. This branch has been merged into charm, so the process I'm using for most of the tests is the following:
checkout the latest charm branch ./build charm++ net-linux-x86_64 --with-prio-type=int --enable-randomized-msgq -j16 --suffix randq-debug -O3 -g export TESTOPTS="++local" (For some tests I also added a +p option)
I'm then running each of the tests in the tests directory a few dozen times on panache. Below are tests that either crash or hang. A child bug will be made for each one. There may be a bug in randomized queues, or it could just be the fact that the test is not written to take randomized queues into account. I ignored test directories without test targets (ie: io, jacobi), as well as tests not included in the master Makefile (ie: hello_crosscoruption, ping). I also verified that all of these tests pass when not using randomized queues.
charm++/alignment: Crashes, failed assertion. charm++/communication_overhead: Hangs charm++/delegation/multicast: Hangs charm++/queue: Hangs (despite being aware of randomized queueing via an #if) charm++/sdag/migration: Hangs
All converse tests pass.
./build AMPI net-linux-x86_64 --with-prio-type=int --enable-randomized-msgq -j16 --suffix randq-debug -O3 -g
ampi/megampi: Crashes rarely due to a failed assertion: Broadcast integer from master> expected 123, got 4!
Original issue: https://charm.cs.illinois.edu/redmine/issues/259
Using randomized queues from branch ramv/randomized-msgq, autobuild suffers from crash/hang in several places. Some of them, that I was able to note, are listed here (occurs once in a while).
./charmrun ./pgm +p10 ++local
I suspect there are more such issues, but given the enormity of task, I was not able to capture them all. The bugs may be in the example programs, or inside Charm.