charliegracie commented 6 years ago

Occasionally the omrthreadtest hangs when running merge or pull request builds. The test always hangs in the ThreadAbortTest.

Today I was able to log into the machine and get a backtrace for the process while it was hung. There was only 1 thread in the system. Here is the backtrace.

#0  0x000003ffaac8d7d4 in __pthread_cond_wait (cond=0x2aa31849ea8, mutex=0x2aa3184e4d8) at pthread_cond_wait.c:186
#1  0x000002aa30504de8 in monitor_wait_original (interruptible=0, nanos=0, millis=0, monitor=0x2aa31844be8, self=0x2aa31847d80) at ./common/omrthread.c:4435
#2  monitor_wait (interruptible=0, nanos=0, millis=0, monitor=0x2aa31844be8) at ./common/omrthread.c:4290
#3  omrthread_monitor_wait (monitor=0x2aa31844be8) at ./common/omrthread.c:4160
#4  0x000002aa304a31d4 in ThreadAbortTest_Waiting_Test::TestBody (this=<optimized out>) at abortTest.cpp:207
#5  0x000002aa304fc3fe in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x2aa305439e2 "the test body", method=<optimized out>, 
    object=<optimized out>) at ../../third_party/gtest-1.8.0/src/gtest.cc:2402
#6  testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x2aa3184d3c0, 
    method=(void (testing::Test::*)(testing::Test * const)) 0x2aa304a3108 <ThreadAbortTest_Waiting_Test::TestBody()>, location=0x2aa305439e2 "the test body")
    at ../../third_party/gtest-1.8.0/src/gtest.cc:2438

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x3ffaae72720 (LWP 63510) "omrthreadtest" 0x000002aa304a31d4 in ThreadAbortTest_Waiting_Test::TestBody (this=<optimized out>) at abortTest.cpp:207

charliegracie commented 6 years ago

One of the travis jobs for the #2115 just had the same hang. Sadly we can not get any diagnostic information from the failing travis build. Interestingly enough though none of the new asserts triggered so that gives some information.

charliegracie commented 6 years ago

2106 also had the same hang in one of the Travis jobs as well but it did not contain the extra asserts.

0xdaryl commented 6 years ago

This problem may not be limited to Z. For reference, see the two failures here in ThreadAbortTest on x86 Linux:

https://travis-ci.org/eclipse/omr/builds/316573235?utm_source=github_status&utm_medium=notification

charliegracie commented 6 years ago

This reproduced in the Linux 390 CI build yesterday and I was able to gather the following output/data:

Test Output:
15:31:33 ./omrthreadtest
15:31:33 [==========] Running 79 tests from 10 test cases.
15:31:33 [----------] 1 test from ThreadAbortDeathTest
15:31:33 [----------] 1 test from ThreadAbortDeathTest (87 ms total)
15:31:33 
15:31:33 [----------] 3 tests from ThreadAbortTest
15:31:33 abortTest.cpp:256: Failure
15:31:33       Expected: 0
15:31:33 To be equal to: omrthread_monitor_enter(testdata->exitSync)
15:31:33       Which is: 9
15:31:33 abortTest.cpp:258: Failure
15:31:33       Expected: 0
15:31:33 To be equal to: omrthread_monitor_notify(testdata->exitSync)
15:31:33       Which is: 1

Info from gdb attached to the process
(gdb) f 4
#4  0x000000008001c9ec in ThreadAbortTest_Waiting_Test::TestBody (this=<optimized out>) at abortTest.cpp:210
(gdb) p testdata
$1 = {exitSync = 0x85027c58, waitSync = 0x85027ce8, waiting = 1, aborted = 1, rc = 9}

(gdb) p *testdata.exitSync
$3 = {count = 0, owner = 0x0, waiting = 0x8502ae20, flags = 270532608, userData = 0, tracing = 0x0, name = 0x800bcca6 "&testdata.exitSync", pinCount = 0, spinlockState = 0, 
  spinCount1 = 256, spinCount2 = 32, spinCount3 = 45, blocking = 0x0, sampleCounter = 0, customSpinOptions = 0x0, spinThreads = 0, mutex = 0x85031468, notifyAllWaiting = 0x0}
(gdb) p *testdata.waitSync
$4 = {count = 0, owner = 0x0, waiting = 0x0.              , flags = 270532608, userData = 0, tracing = 0x0, name = 0x800bccf2 "&testdata.waitSync", pinCount = 0, spinlockState = 0, 
  spinCount1 = 256, spinCount2 = 32, spinCount3 = 45, blocking = 0x0, sampleCounter = 0, customSpinOptions = 0x0, spinThreads = 0, mutex = 0x85031578, notifyAllWaiting = 0x0}

(gdb) p *t
$5 = {library = 0xfffffb50800f5c78, attachcount = 0, priority = 5, monitor = 0x0, next = 0x0, prev = 0x0, tls = {0x0 <repeats 124 times>}, 
  entrypoint = 0x8001d358 <waitingMain(void*)>, entryarg = 0x3ffc267e858, flags = 4196384, tid = 36507, interrupter = 0x0, category = 1, effective_category = 1, 
  lastCategorySwitchTime = 0, tracing = 0x0, waitNumber = 0, lockedmonitorcount = 0, os_errno = -1, handle = 4396289370384, condition = 0x85030758, mutex = 0x85030c98, 
  stacksize = 32768, jumpBuffer = 0x0, destroyed_monitor_head = 0x0, destroyed_monitor_tail = 0x0, key_deletion_attempts = 0}

* 1    Thread 0x3ff97478720 (LWP 36502) "omrthreadtest" 0x000003ff9728d7d4 in __pthread_cond_wait (cond=0x8502cf48, mutex=0x85031468) at pthread_cond_wait.c:186

#0  0x000003ff9728d7d4 in __pthread_cond_wait (cond=0x8502cf48, mutex=0x85031468) at pthread_cond_wait.c:186
#1  0x0000000080085e80 in monitor_wait_original (interruptible=0, nanos=0, millis=0, monitor=0x85027c58, self=0x8502ae20) at ./common/omrthread.c:4435
#2  monitor_wait (interruptible=0, nanos=0, millis=0, monitor=0x85027c58) at ./common/omrthread.c:4290
#3  omrthread_monitor_wait (monitor=0x85027c58) at ./common/omrthread.c:4160
#4  0x000000008001c9ec in ThreadAbortTest_Waiting_Test::TestBody (this=<optimized out>) at abortTest.cpp:210
#5  0x000000008007c11a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x800c3cda "the test body", method=<optimized out>, 
    object=<optimized out>) at ../../third_party/gtest-1.8.0/src/gtest.cc:2402
#6  testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x85030480, method=<optimized out>, location=0x800c3cda "the test body")
    at ../../third_party/gtest-1.8.0/src/gtest.cc:2438
#7  0x000000008006c1d0 in testing::Test::Run (this=0x85030480) at ../../third_party/gtest-1.8.0/src/gtest.cc:2475
#8  0x000000008006c2aa in testing::TestInfo::Run (this=0x8501e880) at ../../third_party/gtest-1.8.0/src/gtest.cc:2656
#9  0x000000008006c402 in testing::TestCase::Run (this=0x8501e520) at ../../third_party/gtest-1.8.0/src/gtest.cc:2774
#10 0x000000008006fbf6 in testing::internal::UnitTestImpl::RunAllTests (this=0x8501ddb0) at ../../third_party/gtest-1.8.0/src/gtest.cc:4651
#11 0x0000000080070002 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=<optimized out>, method=<optimized out>, 
    object=<optimized out>) at ../../third_party/gtest-1.8.0/src/gtest.cc:2402
#12 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0x800c400e "auxiliary test code (environments or event listeners)", 
    method=<optimized out>, object=0x8501ddb0) at ../../third_party/gtest-1.8.0/src/gtest.cc:2438
#13 testing::UnitTest::Run (this=<optimized out>) at ../../third_party/gtest-1.8.0/src/gtest.cc:4262
#14 0x0000000080031a88 in RUN_ALL_TESTS () at ../../third_party/gtest-1.8.0/include/gtest/gtest.h:2233
#15 testMain (argc=1, argv=<optimized out>, envp=<optimized out>) at main.cpp:35
#16 0x000003ff96e22ece in __libc_start_main (main=0x80016a98 <main(int, char**, char**)>, argc=<optimized out>, argv=0x3ffc267ef98, init=<optimized out>, 
    fini=0x800bcb80 <__libc_csu_fini>, rtld_fini=0x3ff97411210 <_dl_fini>, stack_end=0x3ffc267eee0) at libc-start.c:291
#17 0x000000008001b966 in _start ()

Varun-garg commented 6 years ago

Though thread tests run fine (don't crash on gdb), one of the thread tests ( death test / abort test) crashes on Valgrind. If I get it right, It should exit but not crash.

valgrind --main-stacksize=10000000000 --track-origins=yes --leak-check=full --log-fd=1 ./omrthreadtest | less -R

==3749== Invalid read of size 8
==3749==    at 0x1881C6: threadInterrupt (omrthread.c:2697)
==3749==    by 0x187F26: omrthread_abort (omrthread.c:2554)
==3749==    by 0x119568: abortNull() (abortTest.cpp:60)
==3749==    by 0x1192A4: ThreadAbortDeathTest_Nullpointer_Test::TestBody() (abortTest.cpp:53)
==3749==    by 0x175A4A: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)()
, char const*) (gtest.cc:2402)
==3749==    by 0x16FBB0: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), c
har const*) (gtest.cc:2438)
==3749==    by 0x153767: testing::Test::Run() (gtest.cc:2474)
==3749==    by 0x1540A1: testing::TestInfo::Run() (gtest.cc:2656)
==3749==    by 0x15473E: testing::TestCase::Run() (gtest.cc:2774)
==3749==    by 0x15B669: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:4651)
==3749==    by 0x176C10: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::Unit
TestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2402)
==3749==    by 0x1709EC: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTes
tImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2438)
==3749==  Address 0x478 is not stack'd, malloc'd or (recently) free'd
==3749== 
==3749== 
==3749== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==3749==  Access not within mapped region at address 0x478
==3749==    at 0x1881C6: threadInterrupt (omrthread.c:2697)
==3749==    by 0x187F26: omrthread_abort (omrthread.c:2554)
==3749==    by 0x119568: abortNull() (abortTest.cpp:60)
==3749==    by 0x1192A4: ThreadAbortDeathTest_Nullpointer_Test::TestBody() (abortTest.cpp:53)
==3749==    by 0x175A4A: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2402)
==3749==    by 0x16FBB0: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2438)
==3749==    by 0x153767: testing::Test::Run() (gtest.cc:2474)
==3749==    by 0x1540A1: testing::TestInfo::Run() (gtest.cc:2656)
==3749==    by 0x15473E: testing::TestCase::Run() (gtest.cc:2774)
==3749==    by 0x15B669: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:4651)
==3749==    by 0x176C10: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2402)
==3749==    by 0x1709EC: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2438)
==3749==  If you believe this happened as a result of a stack
==3749==  overflow in your program's main thread (unlikely but
==3749==  possible), you can try to increase the size of the
==3749==  main thread stack using the --main-stacksize= flag.
==3749==  The main thread stack size used in this run was 10000003072.

omrthread.c:2697 is THREAD_LOCK(thread, CALLER_INTERRUPT_THREAD); in function threadInterrupt(omrthread_t thread, uintptr_t interruptFlag)

charliegracie commented 6 years ago

@Varun-garg is that failure 100% reproducible?

Varun-garg commented 6 years ago

@charliegracie I will check and let you know shortly.

Varun-garg commented 6 years ago

@charliegracie, yup I got the same error with default configuration on Ubuntu 16.04 (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.6)). Code is also up-to-date with origin.

charliegracie commented 6 years ago

Ok upon further review that is failing in a slightly different test case than I am investigating. The valgrind is pointing out issues in the abortNull test case. ~I will see if I can fix this issue at the same time.~ Valgrind is pointing out that the code is dereferencing NULL, which is the expected behaviour of the test. The test is verifying that we crash if NULL is passed in. I am not sure the test is truly valid but it is currently testing that so I think the valgrind error is a red herring.

charliegracie commented 6 years ago

I believe I have figured out the issue and come up with a solution.

The issue is related to the case where a thread->monitor field is set during a call monitor_enter, which means the thread has been blocked on it, after a thread has been aborted. Threads need to be able to continue to call omrthread_monitor_enter, omrthread_monitor_notify* and omrthread_monitor_exit after they have been aborted so that they can clean up / release and resources. There is a special case in omrthread.c::monitor_enter_three_tier near the end that cleans up the thread->monitor field and handles the aborted thread case. The aborted thread case is not properly wrapped with the if (SET_ABORTABLE == isAbortable) { check like everywhere else so the call to monitor enter fails with an interrupted state. This is the incorrect behaviour which seems to be resolved by adding the proper if (SET_ABORTABLE == isAbortable) { around the check to see if the thread has been aborted.

Varun-garg commented 6 years ago

Thanks for explaining @charliegracie .

eclipse / omr

Occasional hangs in omrthreadtest on linux #2113

2106 also had the same hang in one of the Travis jobs as well but it did not contain the extra asserts.