Closed charliegracie closed 6 years ago
One of the travis jobs for the #2115 just had the same hang. Sadly we can not get any diagnostic information from the failing travis build. Interestingly enough though none of the new asserts triggered so that gives some information.
This problem may not be limited to Z. For reference, see the two failures here in ThreadAbortTest on x86 Linux:
https://travis-ci.org/eclipse/omr/builds/316573235?utm_source=github_status&utm_medium=notification
This reproduced in the Linux 390 CI build yesterday and I was able to gather the following output/data:
Test Output:
15:31:33 ./omrthreadtest
15:31:33 [==========] Running 79 tests from 10 test cases.
15:31:33 [----------] 1 test from ThreadAbortDeathTest
15:31:33 [----------] 1 test from ThreadAbortDeathTest (87 ms total)
15:31:33
15:31:33 [----------] 3 tests from ThreadAbortTest
15:31:33 abortTest.cpp:256: Failure
15:31:33 Expected: 0
15:31:33 To be equal to: omrthread_monitor_enter(testdata->exitSync)
15:31:33 Which is: 9
15:31:33 abortTest.cpp:258: Failure
15:31:33 Expected: 0
15:31:33 To be equal to: omrthread_monitor_notify(testdata->exitSync)
15:31:33 Which is: 1
Info from gdb attached to the process
(gdb) f 4
#4 0x000000008001c9ec in ThreadAbortTest_Waiting_Test::TestBody (this=<optimized out>) at abortTest.cpp:210
(gdb) p testdata
$1 = {exitSync = 0x85027c58, waitSync = 0x85027ce8, waiting = 1, aborted = 1, rc = 9}
(gdb) p *testdata.exitSync
$3 = {count = 0, owner = 0x0, waiting = 0x8502ae20, flags = 270532608, userData = 0, tracing = 0x0, name = 0x800bcca6 "&testdata.exitSync", pinCount = 0, spinlockState = 0,
spinCount1 = 256, spinCount2 = 32, spinCount3 = 45, blocking = 0x0, sampleCounter = 0, customSpinOptions = 0x0, spinThreads = 0, mutex = 0x85031468, notifyAllWaiting = 0x0}
(gdb) p *testdata.waitSync
$4 = {count = 0, owner = 0x0, waiting = 0x0. , flags = 270532608, userData = 0, tracing = 0x0, name = 0x800bccf2 "&testdata.waitSync", pinCount = 0, spinlockState = 0,
spinCount1 = 256, spinCount2 = 32, spinCount3 = 45, blocking = 0x0, sampleCounter = 0, customSpinOptions = 0x0, spinThreads = 0, mutex = 0x85031578, notifyAllWaiting = 0x0}
(gdb) p *t
$5 = {library = 0xfffffb50800f5c78, attachcount = 0, priority = 5, monitor = 0x0, next = 0x0, prev = 0x0, tls = {0x0 <repeats 124 times>},
entrypoint = 0x8001d358 <waitingMain(void*)>, entryarg = 0x3ffc267e858, flags = 4196384, tid = 36507, interrupter = 0x0, category = 1, effective_category = 1,
lastCategorySwitchTime = 0, tracing = 0x0, waitNumber = 0, lockedmonitorcount = 0, os_errno = -1, handle = 4396289370384, condition = 0x85030758, mutex = 0x85030c98,
stacksize = 32768, jumpBuffer = 0x0, destroyed_monitor_head = 0x0, destroyed_monitor_tail = 0x0, key_deletion_attempts = 0}
* 1 Thread 0x3ff97478720 (LWP 36502) "omrthreadtest" 0x000003ff9728d7d4 in __pthread_cond_wait (cond=0x8502cf48, mutex=0x85031468) at pthread_cond_wait.c:186
#0 0x000003ff9728d7d4 in __pthread_cond_wait (cond=0x8502cf48, mutex=0x85031468) at pthread_cond_wait.c:186
#1 0x0000000080085e80 in monitor_wait_original (interruptible=0, nanos=0, millis=0, monitor=0x85027c58, self=0x8502ae20) at ./common/omrthread.c:4435
#2 monitor_wait (interruptible=0, nanos=0, millis=0, monitor=0x85027c58) at ./common/omrthread.c:4290
#3 omrthread_monitor_wait (monitor=0x85027c58) at ./common/omrthread.c:4160
#4 0x000000008001c9ec in ThreadAbortTest_Waiting_Test::TestBody (this=<optimized out>) at abortTest.cpp:210
#5 0x000000008007c11a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x800c3cda "the test body", method=<optimized out>,
object=<optimized out>) at ../../third_party/gtest-1.8.0/src/gtest.cc:2402
#6 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x85030480, method=<optimized out>, location=0x800c3cda "the test body")
at ../../third_party/gtest-1.8.0/src/gtest.cc:2438
#7 0x000000008006c1d0 in testing::Test::Run (this=0x85030480) at ../../third_party/gtest-1.8.0/src/gtest.cc:2475
#8 0x000000008006c2aa in testing::TestInfo::Run (this=0x8501e880) at ../../third_party/gtest-1.8.0/src/gtest.cc:2656
#9 0x000000008006c402 in testing::TestCase::Run (this=0x8501e520) at ../../third_party/gtest-1.8.0/src/gtest.cc:2774
#10 0x000000008006fbf6 in testing::internal::UnitTestImpl::RunAllTests (this=0x8501ddb0) at ../../third_party/gtest-1.8.0/src/gtest.cc:4651
#11 0x0000000080070002 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=<optimized out>, method=<optimized out>,
object=<optimized out>) at ../../third_party/gtest-1.8.0/src/gtest.cc:2402
#12 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0x800c400e "auxiliary test code (environments or event listeners)",
method=<optimized out>, object=0x8501ddb0) at ../../third_party/gtest-1.8.0/src/gtest.cc:2438
#13 testing::UnitTest::Run (this=<optimized out>) at ../../third_party/gtest-1.8.0/src/gtest.cc:4262
#14 0x0000000080031a88 in RUN_ALL_TESTS () at ../../third_party/gtest-1.8.0/include/gtest/gtest.h:2233
#15 testMain (argc=1, argv=<optimized out>, envp=<optimized out>) at main.cpp:35
#16 0x000003ff96e22ece in __libc_start_main (main=0x80016a98 <main(int, char**, char**)>, argc=<optimized out>, argv=0x3ffc267ef98, init=<optimized out>,
fini=0x800bcb80 <__libc_csu_fini>, rtld_fini=0x3ff97411210 <_dl_fini>, stack_end=0x3ffc267eee0) at libc-start.c:291
#17 0x000000008001b966 in _start ()
Though thread tests run fine (don't crash on gdb), one of the thread tests ( death test / abort test) crashes on Valgrind. If I get it right, It should exit but not crash.
valgrind --main-stacksize=10000000000 --track-origins=yes --leak-check=full --log-fd=1 ./omrthreadtest | less -R
==3749== Invalid read of size 8
==3749== at 0x1881C6: threadInterrupt (omrthread.c:2697)
==3749== by 0x187F26: omrthread_abort (omrthread.c:2554)
==3749== by 0x119568: abortNull() (abortTest.cpp:60)
==3749== by 0x1192A4: ThreadAbortDeathTest_Nullpointer_Test::TestBody() (abortTest.cpp:53)
==3749== by 0x175A4A: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)()
, char const*) (gtest.cc:2402)
==3749== by 0x16FBB0: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), c
har const*) (gtest.cc:2438)
==3749== by 0x153767: testing::Test::Run() (gtest.cc:2474)
==3749== by 0x1540A1: testing::TestInfo::Run() (gtest.cc:2656)
==3749== by 0x15473E: testing::TestCase::Run() (gtest.cc:2774)
==3749== by 0x15B669: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:4651)
==3749== by 0x176C10: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::Unit
TestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2402)
==3749== by 0x1709EC: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTes
tImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2438)
==3749== Address 0x478 is not stack'd, malloc'd or (recently) free'd
==3749==
==3749==
==3749== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==3749== Access not within mapped region at address 0x478
==3749== at 0x1881C6: threadInterrupt (omrthread.c:2697)
==3749== by 0x187F26: omrthread_abort (omrthread.c:2554)
==3749== by 0x119568: abortNull() (abortTest.cpp:60)
==3749== by 0x1192A4: ThreadAbortDeathTest_Nullpointer_Test::TestBody() (abortTest.cpp:53)
==3749== by 0x175A4A: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2402)
==3749== by 0x16FBB0: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2438)
==3749== by 0x153767: testing::Test::Run() (gtest.cc:2474)
==3749== by 0x1540A1: testing::TestInfo::Run() (gtest.cc:2656)
==3749== by 0x15473E: testing::TestCase::Run() (gtest.cc:2774)
==3749== by 0x15B669: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:4651)
==3749== by 0x176C10: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2402)
==3749== by 0x1709EC: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2438)
==3749== If you believe this happened as a result of a stack
==3749== overflow in your program's main thread (unlikely but
==3749== possible), you can try to increase the size of the
==3749== main thread stack using the --main-stacksize= flag.
==3749== The main thread stack size used in this run was 10000003072.
omrthread.c:2697
is THREAD_LOCK(thread, CALLER_INTERRUPT_THREAD);
in function threadInterrupt(omrthread_t thread, uintptr_t interruptFlag)
@Varun-garg is that failure 100% reproducible?
@charliegracie I will check and let you know shortly.
@charliegracie, yup I got the same error with default configuration on Ubuntu 16.04 (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.6)
). Code is also up-to-date with origin.
Ok upon further review that is failing in a slightly different test case than I am investigating. The valgrind is pointing out issues in the abortNull test case. ~I will see if I can fix this issue at the same time.~ Valgrind is pointing out that the code is dereferencing NULL, which is the expected behaviour of the test. The test is verifying that we crash if NULL is passed in. I am not sure the test is truly valid but it is currently testing that so I think the valgrind error is a red herring.
I believe I have figured out the issue and come up with a solution.
The issue is related to the case where a thread->monitor field is set during a call monitor_enter, which means the thread has been blocked on it, after a thread has been aborted. Threads need to be able to continue to call omrthread_monitor_enter, omrthread_monitor_notify* and omrthread_monitor_exit after they have been aborted so that they can clean up / release and resources. There is a special case in omrthread.c::monitor_enter_three_tier near the end that cleans up the thread->monitor field and handles the aborted thread case. The aborted thread case is not properly wrapped with the if (SET_ABORTABLE == isAbortable) {
check like everywhere else so the call to monitor enter fails with an interrupted state. This is the incorrect behaviour which seems to be resolved by adding the proper if (SET_ABORTABLE == isAbortable) {
around the check to see if the thread has been aborted.
Thanks for explaining @charliegracie .
Occasionally the omrthreadtest hangs when running merge or pull request builds. The test always hangs in the ThreadAbortTest.
Today I was able to log into the machine and get a backtrace for the process while it was hung. There was only 1 thread in the system. Here is the backtrace.