Closed RSalman closed 11 months ago
@dmitripivkine @amicic @tajila fyi
Would be nice to know in what state (native stack trace, or at least line of code) of the thread that has NULL pointer is. But then I know gdb will give only stack trace of the asserting thread...
Would you please upload system core somewhere, `/team' for example ? Javacore and Snap traces might be useful too.
Would you please upload system core somewhere, `/team' for example ? Javacore and Snap traces might be useful too.
ok, please don't mind. I have uploaded original core from your machine.
To reproduce, the code snippet above can be added to the end of MM_Configuration::reinitializeForRestore(MM_EnvironmentBase *env) and the test script can be run with modified VM.
Alternatively, patch can be taken from https://github.com/eclipse/omr/pull/7038
The problem is in allocateVMThread()
the sequence is:
@singh264 Please take a look at this
checkpointJVMImpl
solves the issue, if not we will need to do something else@tajila I reproduced the issue, and adding System.gc() call before checkpointJVMImpl
seems to solve the issue as I do not observe the segmentation error after running the test script for 1000 iterations.
Okay, please prepare a PR with that change
Intermittent failure (SEG fault) found during development, which seems to have a larger underlying cause. The following code was added locally (invoked during JVM restore) to iterate over all the JVM threads and fetch their GC env:
This pattern of iterating threads is typical and found throughout the codebase. An iterated thread has always been assumed to have a valid GC Environment (i.e., NULL check has not been needed). However, during restore, this iteration may result in a crash as a thread may return a NULL GC Environment. It's seems like a thread is in a inconsistent state. I can't tell if there is an actual concern to be addressed, I'm not aware that this type of thread has caused issues elsewhere, outside of this GC Env iteration. There is a workaround to just do a NULL check, but this issue may be worth investigating as it may cause other issues later on.
This issue was initially discovered while running the CRIU sanity tests on the farm. Reproducing it locally has been strange, the script below is used. The 1000 itrs show no failure, however a failure shows up when the test is interrupted at random points (SIGINT/Ctrl+C). Need to do further investigation to reconcile how it is produced locally vs failure on farm.
To reproduce, the code snippet above can be added to the end of
MM_Configuration::reinitializeForRestore(MM_EnvironmentBase *env)
and the test script can be run with modified VM.If an assertion is added to verify the env: