Open tivervac opened 2 years ago
We don't use Lucene nor JavaFX, but that is the only link I found to PhaseIdealLoop::build_loop_late_post_work
as well
We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable. It will be closed soon unless the stale label is removed by a committer, or a new comment is made.
In the meantime, the linked bug has been split in two. The first appears irrelevant to us, we still encounter the issue with JustJ 17.0.4. The second is probably the one we need fixing.
17.0.4.1 should have https://bugs.openjdk.org/browse/JDK-8275610 fixed (the first)
Pinged about this at EclipseCon by @tivervac
Maintainers are commenting on the upstream issue, but it's still open for now - https://bugs.openjdk.org/browse/JDK-8285835.
It looks like there is some PR now: https://github.com/openjdk/jdk/pull/10894
I analyzed it and figured out where the issue happens with our Lucene code. The test case is hard to understand but it seems to happen if you have a loop over some code dereferencing object instances through multiple layers (A wraps B wraps C).
The same issue seems to also affect Ben Manes' Caffeine library: https://github.com/ben-manes/caffeine/issues/797
@uschindler That's quite possible. By now we've been able to (temporarily?) work around this issue by not obfuscating one of our classes. We obfuscate using ZKM. That obfuscator likes to split off code and wrap it in other classes.
So you don't have the source code of: C2: 25179 18398 4 com.sigasi.hdt.vhdl.effectanalysis.d::visitIdentifierPathElement (97 bytes)
??? Too bad. Maybe a disassembly of bytecode?
Sadly, I had this info at the time of writing, but not anymore at this point.
I'll see whether I can reproduce it again
Am I correct that upstream bug should be resolved in 19.0.2+7 ? I feel that we faced same problem using 19.0.2+7
No fix version is for Java 20
@karianna How can I check it? I thought that https://bugs.openjdk.org/browse/JDK-8297510 corresponds to issue above and should be resolved in 19.0.2 build 7
The link shows that the issue was backported
The issue is also shown in release notes for 19.0.2+7 https://www.oracle.com/java/technologies/javase/19all-relnotes.html
Fair point, should be fixed then. If you have a new crash log I can post it.
We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable. It will be closed soon unless the stale label is removed by a committer, or a new comment is made.
@vans239 Are you able to try the latest LTS or Java 20.0.1 and let us know if this is resolved for you?
Sadly even with the backported bug fix mentioned above, we're still encountering the issue with Temurin 17.0.7+7
Are you able to try the latest LTS or Java 20.0.1 and let us know if this is resolved for you?
We were still observing issues with 19.0.2+7. I am trying currently 20.0.1+9 and not able reproduce so far. Will have more info next week when we rollout it fully to prod
Sadly even with the backported bug fix mentioned above, we're still encountering the issue with
Temurin 17.0.7+7
I would try 17.0.8 JIC.
Maybe related: https://github.com/openjdk/jdk/pull/15399 ... "SIGSEGV in PhaseIdealLoop::build_loop_late_post_work" ... but for a different reason. Backporting has not been even discussed on that one yet.
Definitely possible, thanks for the link!
Here are 2 crashes from Apache Pulsar, full hs_err_pid*.log files: https://gist.github.com/lhotari/53b72683ad4f339dfbcfd8b9b97062b9 .
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f927e8d5113, pid=3924, tid=4012
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.8.1+1 (17.0.8.1+1) (build 17.0.8.1+1)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.8.1+1 (17.0.8.1+1, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0xad5113] PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0xe3
#
Happens in 17.0.8.1 . The Apache Pulsar issue is https://github.com/apache/pulsar/issues/19307 . Any help is appreciated.
Looks like the previously posted GH issue links to https://bugs.openjdk.org/browse/JDK-8314024 which will be backported to 17.0.10 .
Summary
We run an Eclipse-based product obfuscated using ZKM. Running our tests on CI has been causing frequent SIGSEGV's.
Steps to reproduce
The error is "rare" (one in 10 builds usually, on some of our branches it's 3/4, on others 1/20),
See our hs_err_pid, replay_pid and core dump (2.3 GB)
We're determined to help you help us. If there's anything more we can do, please let us know. We're trying to minimize this to a reproducible example, but that will take time, and definitely won't be easy due to the extreme flakiness of the failure.
Expected results
No crash
Actual results
Random SIGSEGV's likely heavily influenced by code layout and timings.
Triaging info
Java version:
OpenJDK Runtime Environment Temurin-17.0.3+7 (17.0.3+7) (build 17.0.3+7)
What is your operating system and platform?
Amazon Linux release 2 (Karoo) on x86-64
How did you install Java?
Binary archive, tar.gz.
Did it work before?
We've been having this issue for months, since it relies on timings and code layouts there are periods in which we have many failures, then weeks without any.
Did you test with other Java versions?
Been having this since < Java 17. We haven't tried other VMs such as Graal or OpenJ9.
We've been faithfully upgrading to the latest Temurin version since Java 11, up to 12, 13, 14, 15 and now 17. (we skipped 16).