dtr-org / unit-e

A digital currency for a new era of decentralized trust
https://unit-e.io
MIT License
45 stars 15 forks source link

waiter_tests.cpp is floating #933

Closed kostyantyn closed 5 years ago

kostyantyn commented 5 years ago

This test seems to fail randomly. Here is the log: waiter_test.log

frolosofsky commented 5 years ago

I run this test zillion times and didn't reproduce. My best guess is a clock glitch happened during that failed try, but it is extremely unlikely. Anyway I fixed possible glitch in #942 and extended test coverage.

cmihai commented 5 years ago

This test usually runs fine when executed with test_unite, but fails for me when run under contrib/devtools/run-unit-tests.sh.

scravy commented 5 years ago

Makes sense somehow, run unit tests runs with crazy parallelism ;-)

frolosofsky commented 5 years ago

Closing the issue as it seems to be fixed but very difficult to reproduce. According to @cmihai, who is facing with this occasionally, it doesn't happen anymore: https://github.com/dtr-org/unit-e/pull/942#pullrequestreview-225584586.

frolosofsky commented 5 years ago

Happened again.

Failure: waiter_tests from test/waiter_tests.cpp
Running 1 test case...
Test cases order is shuffled using seed: 1108117134
Entering test module "Unit-e Test Suite"
test/waiter_tests.cpp(14): Entering test suite "waiter_tests"
test/waiter_tests.cpp(29): Entering test case "wait_and_wake_test"
test/waiter_tests.cpp(22): fatal error: in "waiter_tests/wait_and_wake_test": critical check now < wait_until has failed
terminate called without an active exception
unknown location(0): fatal error: in "waiter_tests/wait_and_wake_test": signal: SIGABRT (application abort requested)
test/waiter_tests.cpp(22): last checkpoint
test/waiter_tests.cpp(29): Leaving test case "wait_and_wake_test"; testing time: 20075710us
test/waiter_tests.cpp(14): Leaving test suite "waiter_tests"; testing time: 20075723us
cmihai commented 5 years ago

According to https://en.cppreference.com/w/cpp/thread/condition_variable/notify_all :

The notifying thread does not need to hold the lock on the same mutex as the one held by the waiting thread(s); in fact doing so is a pessimization, since the notified thread would immediately block again, waiting for the notifying thread to release the lock.

This is the case in our code: https://github.com/dtr-org/unit-e/blob/ca68df2878645c9f968d2520a5459e5718e48721/src/proposer/waiter.cpp#L20 . Could this be the root of the issue?

scravy commented 5 years ago

It happened again to me:

Failure: waiter_tests from test/waiter_tests.cpp
Running 1 test case...
Entering test module "Unit-e Test Suite"
test/waiter_tests.cpp(14): Entering test suite "waiter_tests"
test/waiter_tests.cpp(29): Entering test case "wait_and_wake_test"
test/waiter_tests.cpp(22): fatal error: in "waiter_tests/wait_and_wake_test": critical check now < wait_until has failed
terminate called without an active exception
unknown location(0): fatal error: in "waiter_tests/wait_and_wake_test": signal: SIGABRT (application abort requested)
test/waiter_tests.cpp(22): last checkpoint
test/waiter_tests.cpp(29): Leaving test case "wait_and_wake_test"; testing time: 28876094us
test/waiter_tests.cpp(14): Leaving test suite "waiter_tests"; testing time: 28876108us
Leaving test module "Unit-e Test Suite"; testing time: 28876198us

*** 2 failures are detected in the test module "Unit-e Test Suite"
make[3]: *** [Makefile:11901: test/waiter_tests.cpp.test] Error 1
frolosofsky commented 5 years ago

Fixed by erasing :)