SIGSEGV in PhaseIdealLoop::build_loop_late_post_work

tivervac commented 2 years ago

Summary

We run an Eclipse-based product obfuscated using ZKM. Running our tests on CI has been causing frequent SIGSEGV's.

Steps to reproduce

The error is "rare" (one in 10 builds usually, on some of our branches it's 3/4, on others 1/20),

See our hs_err_pid, replay_pid and core dump (2.3 GB)

We're determined to help you help us. If there's anything more we can do, please let us know. We're trying to minimize this to a reproducible example, but that will take time, and definitely won't be easy due to the extreme flakiness of the failure.

Expected results

No crash

Actual results

Random SIGSEGV's likely heavily influenced by code layout and timings.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ff7086a270b, pid=27153, tid=27169
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.3+7 (17.0.3+7) (build 17.0.3+7)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.3+7 (17.0.3+7, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xac870b]  PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0x13b
#
# Core dump will be written. Default location: /home/jenkins/agent/workspace/line_8382-speed-up-product-build/com.sigasi.hdt.vhdl.test.projects/core.27153
#
# An error report file with more information is saved as:
# /home/jenkins/agent/workspace/line_8382-speed-up-product-build/com.sigasi.hdt.vhdl.test.projects/hs_err_pid27153.log
#
# Compiler replay data is saved as:
# /home/jenkins/agent/workspace/line_8382-speed-up-product-build/com.sigasi.hdt.vhdl.test.projects/replay_pid27153.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

Triaging info

Java version:

OpenJDK Runtime Environment Temurin-17.0.3+7 (17.0.3+7) (build 17.0.3+7)

What is your operating system and platform?

Amazon Linux release 2 (Karoo) on x86-64

How did you install Java?

Binary archive, tar.gz.

Did it work before?

We've been having this issue for months, since it relies on timings and code layouts there are periods in which we have many failures, then weeks without any.

Did you test with other Java versions?

Been having this since < Java 17. We haven't tried other VMs such as Graal or OpenJ9.

We've been faithfully upgrading to the latest Temurin version since Java 11, up to 12, 13, 14, 15 and now 17. (we skipped 16).

jerboaa commented 2 years ago

could be https://bugs.openjdk.org/browse/JDK-8283386

tivervac commented 2 years ago

We don't use Lucene nor JavaFX, but that is the only link I found to PhaseIdealLoop::build_loop_late_post_work as well

github-actions[bot] commented 2 years ago

We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable. It will be closed soon unless the stale label is removed by a committer, or a new comment is made.

tivervac commented 2 years ago

In the meantime, the linked bug has been split in two. The first appears irrelevant to us, we still encounter the issue with JustJ 17.0.4. The second is probably the one we need fixing.

karianna commented 2 years ago

17.0.4.1 should have https://bugs.openjdk.org/browse/JDK-8275610 fixed (the first)

tellison commented 1 year ago

Pinged about this at EclipseCon by @tivervac

karianna commented 1 year ago

Maintainers are commenting on the upstream issue, but it's still open for now - https://bugs.openjdk.org/browse/JDK-8285835.

uschindler commented 1 year ago

It looks like there is some PR now: https://github.com/openjdk/jdk/pull/10894

I analyzed it and figured out where the issue happens with our Lucene code. The test case is hard to understand but it seems to happen if you have a loop over some code dereferencing object instances through multiple layers (A wraps B wraps C).

The same issue seems to also affect Ben Manes' Caffeine library: https://github.com/ben-manes/caffeine/issues/797

tivervac commented 1 year ago

@uschindler That's quite possible. By now we've been able to (temporarily?) work around this issue by not obfuscating one of our classes. We obfuscate using ZKM. That obfuscator likes to split off code and wrap it in other classes.

uschindler commented 1 year ago

So you don't have the source code of: C2: 25179 18398 4 com.sigasi.hdt.vhdl.effectanalysis.d::visitIdentifierPathElement (97 bytes)

??? Too bad. Maybe a disassembly of bytecode?

tivervac commented 1 year ago

Sadly, I had this info at the time of writing, but not anymore at this point.

I'll see whether I can reproduce it again

vans239 commented 1 year ago

Am I correct that upstream bug should be resolved in 19.0.2+7 ? I feel that we faced same problem using 19.0.2+7

karianna commented 1 year ago

No fix version is for Java 20

vans239 commented 1 year ago

@karianna How can I check it? I thought that https://bugs.openjdk.org/browse/JDK-8297510 corresponds to issue above and should be resolved in 19.0.2 build 7

karianna commented 1 year ago

https://bugs.openjdk.org/browse/JDK-8285835

vans239 commented 1 year ago

The link shows that the issue was backported

The issue is also shown in release notes for 19.0.2+7 https://www.oracle.com/java/technologies/javase/19all-relnotes.html

karianna commented 1 year ago

Fair point, should be fixed then. If you have a new crash log I can post it.

github-actions[bot] commented 1 year ago

We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable. It will be closed soon unless the stale label is removed by a committer, or a new comment is made.

karianna commented 1 year ago

@vans239 Are you able to try the latest LTS or Java 20.0.1 and let us know if this is resolved for you?

tivervac commented 1 year ago

Sadly even with the backported bug fix mentioned above, we're still encountering the issue with Temurin 17.0.7+7

vans239 commented 1 year ago

Are you able to try the latest LTS or Java 20.0.1 and let us know if this is resolved for you?

We were still observing issues with 19.0.2+7. I am trying currently 20.0.1+9 and not able reproduce so far. Will have more info next week when we rollout it fully to prod

karianna commented 1 year ago

Sadly even with the backported bug fix mentioned above, we're still encountering the issue with Temurin 17.0.7+7

I would try 17.0.8 JIC.

karniemi commented 1 year ago

Maybe related: https://github.com/openjdk/jdk/pull/15399 ... "SIGSEGV in PhaseIdealLoop::build_loop_late_post_work" ... but for a different reason. Backporting has not been even discussed on that one yet.

tivervac commented 1 year ago

Definitely possible, thanks for the link!

lhotari commented 11 months ago

Here are 2 crashes from Apache Pulsar, full hs_err_pid*.log files: https://gist.github.com/lhotari/53b72683ad4f339dfbcfd8b9b97062b9 .

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f927e8d5113, pid=3924, tid=4012
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.8.1+1 (17.0.8.1+1) (build 17.0.8.1+1)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.8.1+1 (17.0.8.1+1, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xad5113]  PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0xe3
#

Happens in 17.0.8.1 . The Apache Pulsar issue is https://github.com/apache/pulsar/issues/19307 . Any help is appreciated.

lhotari commented 11 months ago

Looks like the previously posted GH issue links to https://bugs.openjdk.org/browse/JDK-8314024 which will be backported to 17.0.10 .

adoptium / adoptium-support