Open ninja- opened 2 years ago
@tajila
To get full debug symbols, there is a debug image that can be overlayed. The bin and lib directories in the debug image need to be recursively (including sub-directories) copied to the bin and lib directories of the JDK.
gdb is going to do a better job of printing a stack trace, you can overlay the debug image, open the core in gdb and use where
to get the stack trace.
Can the core file or test case to recreate the problem be shared?
@ninja- do you have a sample application that we can try to reproduce the problem?
Hi
Sorry the classpath is pretty complex and the main jar that triggers the bug is 3rd party obfuscated component - but without any natives inside to my knowledge.
Coredump was not captured sadly because the app is running in docker container so it's gone after crash. I plan to update the tooling to capture coredumps and continue investigating this next week.
In the meantime, maybe it's worth double checking everything in this func around JAVA_SPEC_VERSION >= 16? Could this commit on master fixed this issue silently? https://github.com/eclipse-openj9/openj9/commit/8635b0422fc8513dcf2e51390027cd8a4ee00cd5
@ninja- are you able to provide a core file? Unless we have more info, we won't be able to address this for the next release
We can't diagnose this problem with the available information, moving it forward.
I appear to have a new instance of this in JRuby's OpenJ9 CI run here: https://github.com/jruby/jruby/actions/runs/11525329378/job/32088572723?pr=8386
This just started crashing with a segmentation error in the past few days. I'm not sure what changed in our environment (the PR was just adding non-OpenJ9 JDK 23 jobs to our build matrix).
I have captured the dump files from that crash and uploaded them here:
https://drive.google.com/file/d/1OrsBTG5w6HRsQmPi-6xmVRYIusz186ty/view?usp=sharing
Please let me know when you have downloaded this file so I can delete it.
I will attempt to switch to a different build (using the IBM Semeru dist) for now.
Switching to the "semeru" distribution from setup-java does not appear to have helped, so I'll be removing OpenJ9 from our CI for now.
@headius Thanks for the diagnostics. If possible could you please reproduce it and run jpackcore, this will give us more info to address to issue.
In the meantime we will take a look at what you've sent us.
@babsingh Please take a look. You can use nativedecoder to find the line numbers in the stack strace.
Please let me know when you have downloaded this file so I can delete it.
@headius, I have also uploaded your zip file here:
If anyone else needs access to it, but cannot open that link, please let me know.
@tajila I will give it a shot.
Based on this output from the crash, it would seem it is not generating the right core file, am I right?
If so, how do I make it successfully dump core?
If not, which if these files should I run jpackcore
on?
JVMDUMP039I Processing dump event "gpf", detail "" at 2024/10/26 13:50:53 - please wait.
JVMDUMP032I JVM requested System dump using '/home/runner/work/jruby/jruby/core.20241026.135053.2237.0001.dmp' in response to an event
JVMPORT030W /proc/sys/kernel/core_pattern setting "|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" specifies that the core dump is to be piped to an external program. Attempting to rename either core or core.2264. Review the manual for the external program to find where the core dump is written and ensure the program does not truncate it.
JVMPORT0[49](https://github.com/jruby/jruby/actions/runs/11532327632/job/32104018591#step:5:50)I The core file created by child process with pid = 2264 was not found. Review the documentation for the /proc/sys/kernel/core_pattern program "|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" to find where the core file is written and ensure that program does not truncate it.
JVMDUMP012E Error in System dump: /home/runner/work/jruby/jruby/core.20241026.13[50](https://github.com/jruby/jruby/actions/runs/11532327632/job/32104018591#step:5:51)53.2237.0001.dmp
JVMDUMP032I JVM requested Java dump using '/home/runner/work/jruby/jruby/javacore.20241026.135053.2237.0002.txt' in response to an event
JVMDUMP010I Java dump written to /home/runner/work/jruby/jruby/javacore.20241026.135053.2237.0002.txt
JVMDUMP032I JVM requested Snap dump using '/home/runner/work/jruby/jruby/Snap.20241026.135053.2237.0003.trc' in response to an event
JVMDUMP010I Snap dump written to /home/runner/work/jruby/jruby/Snap.20241026.135053.2237.0003.trc
JVMDUMP032I JVM requested JIT dump using '/home/runner/work/jruby/jruby/jitdump.20241026.135053.2237.0004.dmp' in response to an event
JVMDUMP0[51](https://github.com/jruby/jruby/actions/runs/11532327632/job/32104018591#step:5:52)I JIT dump occurred in 'main' thread 0x0000000000017000
JVMDUMP053I JIT dump is recompiling java/lang/invoke/MemberName$Factory.resolve(BLjava/lang/invoke/MemberName;Ljava/lang/Class;IZ)Ljava/lang/invoke/MemberName;
JVMDUMP053I JIT dump is recompiling java/lang/invoke/MemberName$Factory.resolveOrFail(BLjava/lang/invoke/MemberName;Ljava/lang/Class;ILjava/lang/Class;)Ljava/lang/invoke/MemberName;
JVMDUMP010I JIT dump written to /home/runner/work/jruby/jruby/jitdump.20241026.1350[53](https://github.com/jruby/jruby/actions/runs/11532327632/job/32104018591#step:5:54).2237.0004.dmp
JVMDUMP013I Processed dump event "gpf", detail "".
@hzongaro Thank you! I will delete my copy of those files and let y'all know if I can upload a packed core thingy.
re https://github.com/eclipse-openj9/openj9/issues/14191#issuecomment-2442054525:
JVMPORT030W /proc/sys/kernel/core_pattern setting "|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" specifies that the core dump is to be piped to an external program. Attempting to rename either core or core.2264. Review the manual for the external program to find where the core dump is written and ensure the program does not truncate it. JVMPORT049I The core file created by child process with pid = 2264 was not found. Review the documentation for the /proc/sys/kernel/core_pattern program "|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" to find where the core file is written and ensure that program does not truncate it.
The below instructions will allow the machine to correctly produce core files.
echo "core%p" > /proc/sys/kernel/core_pattern
echo "kernel.core_pattern=core.%p" >> /etc/sysctl.conf
# systemctl stop apport
# systemctl status apport # confirm it's inactive
# systemctl disable apport
# systemctl is-enabled apport # confirm it's disabled
disabled
ii. Prevent Apport from being upgraded (which re-enables it)
# apt-mark hold apport
# apt-mark showhold # confirm it's marked
apport
Even with sudo it seems that the core_pattern
file can't be written to on this system (standard GitHub Actions runner on Ubuntu:
/home/runner/work/_temp/477d41c2-bd5b-45a8-ab3b-b29ab49cac66.sh: line 1: /proc/sys/kernel/core_pattern: Permission denied
FWIW it should be possible for you to take a branch from this JRuby commit and reproduce the failure based on how it runs in CI. I do not have a local reproduction, so my only way to reproduce is on GHA:
https://github.com/jruby/jruby/pull/8386/commits/d64634e9d5734f811fb2ae6d4a42f1f5c72b1676
Here is a javacore, which has line numbers for the thread's native stack: nstack.20241028083039.javacore.20241026.134413.2239.0002.txt
The crash happens here:
4XENATIVESTACK Java_java_lang_invoke_MethodHandleNatives_resolve+0x8eb (0x00007FAE71B5D9FB [libjclse29.so+0x539fb]) : 0x539fb <Java_java_lang_invoke_MethodHandleNatives_resolve(JNIEnv*, jclass, jobject, jclass, jint, jboolean)+2283> [/home/jenkins/workspace/build-scripts/jobs/jdk17u/jdk17u-linux-x64-openj9/workspace/build/src/openj9/runtime/jcl/common/java_lang_invoke_MethodHandleNatives.cpp:1113]
The associated code: https://github.com/eclipse-openj9/openj9/blob/4760d5d3202149479ba7beca189fcbdbe2e0e79b/runtime/jcl/common/java_lang_invoke_MethodHandleNatives.cpp#L1113
verifyData
(J9JavaVM->bytecodeVerificationData
) is NULL. We will need to add a null check to fix this.
In the Github CI job, I also see the following message: JVMJ9VM193W Since Java 13 -Xverify:none and -noverify were deprecated for removal and may not be accepted options in the future
. @headius Can we run the job without -Xverify:none
? -Xverify:none
is probably setting bytecodeVerificationData
to NULL.
@babsingh Can do!
Perhaps unsurprisingly, it no longer crashes with -Xverify:none
removed from command line flags.
@headius Thanks for confirming. https://github.com/eclipse-openj9/openj9/pull/20422 should fix the segfault that occurs with -Xverify:none
. Will get this PR reviewed and merged.
To verify if the issue is resolved, I was trying to run the CI cmds locally with JDK17 w/ the fix. My local environment is not correctly setup for the CI cmds; so, I am encountering unrelated errors.
@headius Can you try JDK17 w/ the fix on your end to see if the issue is resolved?
@headius Were you able to try the above JDK?
Oops, sorry about that, I forgot about this one. I'll verify over the weekend and let you know!
After upgrade from AdoptOpenJDK OpenJ9 16 to latest Semeru build, my app is crashing a few minutes after start with this crash:
If there is some procedure to easily get debug symbols for Semeru please let me know, I remember we were discussing a few months ago here on GitHub to include them by default but looks like nothing changed on that....