cmdLineTester_GCRegressionTests_2 assert

pshipton commented 1 year ago

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.functional_ppc64le_linux_aot_Personal_testList_0/118 cmdLineTester_GCRegressionTests_2 -Xshareclasses:name=test_aot -Xscmx400M -Xscmaxaot256m -XX:+UseCompressedOops -Xjit -Xgcpolicy:balanced

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_sanity.functional_ppc64le_linux_aot_Personal_testList_0/118/functional_test_output.tar.gz

=== AOT ITERATION 2 ===

Testing: Ensure no core files have been produced by the preceding tests
Test start time: 2023/04/11 12:30:30 Coordinated Universal Time
Running command: sh -c ls
Time spent starting: 4 milliseconds
Time spent executing: 12 milliseconds
Test result: FAILED
Output from test:
 [OUT] core.20230411.122821.2809655.0001.dmp
 [OUT] foo.%001.log
 [OUT] foo.001.log
 [OUT] foo.002.log
 [OUT] foo.#.log
 [OUT] foo.log
 [OUT] foo.log.001
 [OUT] foo.log.002
 [OUT] foo.log.003
 [OUT] javacore.20230411.122821.2809655.0002.txt
 [OUT] Snap.20230411.122821.2809655.0003.trc
 [OUT] static_temp.txt
 [OUT] verbosegc.20230411.122502.2809278.txt
 [OUT] verbosegc.20230411.122511.2809497.txt

1XECTHTYPE     Current thread history (J9VMThread:0x0000000000217800)
3XEHSTTYPE     12:28:21:539248870 GMT j9mm.107 - * ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK17_ppc64le_linux_aot_Personal/omr/gc/base/ParallelDispatcher.cpp:176: ((false && ((worker_status_reserved == _statusTable[workerID]) || ((0 == _threadsToReserve) && (worker_status_dying == _statusTable[workerID])))))

2CIENVVAR      OPENJ9_JAVA_COMMAND_LINE=/home/jenkins/jenkins-agent/workspace/Test_openjdk17_j9_sanity.functional_ppc64le_linux_aot_Personal_testList_0/openjdkbinary/j2sdk-image/bin/java -Xshareclasses:name=test_aot -Xscmx400M -Xscmaxaot256m -XX:+UseCompressedOops -Xjit -Xgcpolicy:balanced -Dcom.ibm.tools.attach.enable=no -Xdump:system:events=systhrow,filter=java/lang/OutOfMemoryError -Xmx8m -Xms8m -Xalwaysclassgc -Xdisableexcessivegc -cp /home/jenkins/jenkins-agent/workspace/Test_openjdk17_j9_sanity.functional_ppc64le_linux_aot_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/gcRegressionTests/gcRegressionTests.jar com.ibm.tests.garbagecollector.TestClassLoaderMain - - -

1XMCURTHDINFO  Current thread
3XMTHREADINFO      "GC Worker" J9VMThread:0x0000000000217800, omrthread_t:0x0000724A742F3060, java/lang/Thread:0x00000000FF9AA018, state:R, prio=5

4XENATIVESTACK               raiseAssertion+0x14c (0x0000724A78B61B8C [libj9trc29.so+0x11b8c])
4XENATIVESTACK               doTracePoint+0x4a4 (0x0000724A78B660E4 [libj9trc29.so+0x160e4])
4XENATIVESTACK               javaTrace+0x4c (0x0000724A78B5C6DC [libj9trc29.so+0xc6dc])
4XENATIVESTACK               _ZN21MM_ParallelDispatcher16workerEntryPointEP18MM_EnvironmentBase+0x3bc (0x0000724A78870B6C [libj9gc29.so+0x180b6c])
4XENATIVESTACK               _Z23dispatcher_thread_proc2P14OMRPortLibraryPv+0x160 (0x0000724A7886FE60 [libj9gc29.so+0x17fe60])

pshipton commented 1 year ago

@dmitripivkine

dmitripivkine commented 1 year ago

             * In which case, _workerThreadsReservedForGC is set but the task runs single threaded and exiting threads are dying instead of reserved. */
            Assert_MM_true((worker_status_reserved == _statusTable[workerID]) || ((0 == _threadsToReserve) && (worker_status_dying == _statusTable[workerID])));

> !mm_paralleldispatcher 0x0000724A74069950
MM_ParallelDispatcher at 0x724a74069950 {
  Fields for MM_Base:
  Fields for MM_BaseVirtual:
    0x8: const U8* _typeId = !j9x 0x0000724A7898F8C8 // "MM_ParallelDispatcher"
  Fields for MM_ParallelDispatcher:
    0x10: class MM_Task* _task = !mm_concurrentglobalmarktask 0x0000724A5A28BE58
    0x18: class MM_GCExtensionsBase* _extensions = !mm_gcextensions 0x0000724A740557C0
    0x20: U64 _threadShutdownCount = 0x0000000000000001 (1)
    0x28: struct J9Thread** _threadTable = !j9x 0x0000724A74069A30
    0x30: U64* _statusTable = !j9x 0x0000724A74069AA0
    0x38: void** _taskTable = !j9x 0x0000724A74069B10
    0x40: struct J9ThreadMonitor* _workerThreadMutex = !j9threadmonitor 0x0000724A7401BBC8
    0x48: struct J9ThreadMonitor* _dispatcherMonitor = !j9threadmonitor 0x0000724A7401BC78
    0x50: struct J9ThreadMonitor* _synchronizeMutex = !j9threadmonitor 0x0000724A7401BD28
    0x58: bool _workerThreadsReservedForGC = true
    0x59: bool _inShutdown = true
    0x60: U64 _threadCountMaximum = 0x0000000000000004 (4)
    0x68: U64 _threadCount = 0x0000000000000001 (1)
    0x70: U64 _activeThreadCount = 0x0000000000000004 (4)
    0x78: U64 _threadsToReserve = 0x0000000000000003 (3) <-----
    0x80: void* _handler = !j9x 0x0000724A7925FD80
    0x88: void* _handler_arg = !j9x 0x0000724A74028400
    0x90: U64 _defaultOSStackSize = 0x0000000000080000 (524288)
}

0x10: U64 _workerID = 0x0000000000000002 (2) <--------------

_statusTable = !j9x 0x0000724A74069AA0
0x724A74069AA0 :  0000000000000003 0000000000000004 [ ................ ]
0x724A74069AB0 :  0000000000000004 0000000000000004 [ ................ ]

    enum {
        worker_status_inactive = 0, /* Must be 0 - set at initialization time by memset */
        worker_status_waiting,
        worker_status_reserved, <--- 2
        worker_status_active,
        worker_status_dying <--- 4
    };

_statusTable[workerID] is worker_status_dying (not worker_status_reserved) _threadsToReserve is 3 So, code expects _threadsToReserve to be 0 but it is not

@RSalman FYI

RSalman commented 1 year ago

This is similar to an issue encountered in the past with a race between thread shutdown and task dispatch, this has been extensively detailed https://github.com/eclipse-openj9/openj9/issues/10491#issuecomment-686057277. I've revisited the recent dispatcher changes and I can confirm that this isn't a regression, I believe its been around since Dispatcher changes back in 2020.

This failure exposes a timing hole where thread shutdown and task dispatch overlap and result in unexpected Dispatcher state, where task dispatch attempts to reserve the same set of threads that are being shutdown. The exact scenario is detailed in the fix: https://github.com/eclipse/omr/pull/6954.

The following holds true for this issue as well:

Reproducing this scenario is a challenge, to have GC during shutdown + ensuring that main thread acquires mutex before worker threads are all woken up and terminated. With 1000s of iterations on multiple platforms/machines + varying heap configuration, I wasn't able to reproduce this issue. Luckily the core dump/trace points is sufficient to figure out what's happening.

I don't think its worth the effort to reproduce this issue, its not trivial to do.

RSalman commented 1 year ago

I was able to reproduce the issue and verify the fix by

Using concurrent scavenger
Delaying shutdown by 10ms
- Gives opportunity for task dispatch
Delaying task dispatch during shutdown by 10ms
- Gives opportunity for Shutdown to continue
Running the following configuration with vich: java -Xgc:concurrentscavenge -Xmn4M -cp /salman.rana/vich.jar vich.runner.Menu 14

Reproducibility is ~ 1/10. Verified fix with ~1000 itrs

eclipse-openj9 / openj9

cmdLineTester_GCRegressionTests_2 assert #17155