corretto / corretto-17

Amazon Corretto 17 is a no-cost, multi-platform, production-ready distribution of OpenJDK 17
GNU General Public License v2.0
211 stars 49 forks source link

JVM crash with SIGSEGV #57

Open aablsk opened 2 years ago

aablsk commented 2 years ago

Describe the bug

What: After updating to amazoncorretto:17 we've seen irregular JVM-crashes for a workload with below log. The crash usually happens within the first 5 minutes after starting the workload. Up until the crash, the workload works as expected.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000000, pid=1, tid=14
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.2.8.1 (17.0.2+8) (build 17.0.2+8-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.2.8.1 (17.0.2+8-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# C  0x0000000000000000
#
# Core dump will be written. Default location: //core.1
#
# An error report file with more information is saved as:
# //hs_err_pid1.log
#
# Compiler replay data is saved as:
# //replay_pid1.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/corretto/corretto-17/issues/
#
[error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007fd7072cb23b]

How often: twice with a period of 7 days in between Where: Workload runs as a ECS Fargate task Dumps: None as the dumps were only written to ephemeral storage so far (if that worked as expected)

To Reproduce

No reliable reproduction as this happens very rarely.

Expected behavior

JVM does not crash. When JVM crashes, it is able to report the error correctly.

Platform information

OS: Amazon Linux 2
Version: Corretto-17.0.2.8.1 (17.0.2+8) (build 17.0.2+8-LTS) (see log above)
Base-image: public.ecr.aws/amazoncorretto/amazoncorretto:17

For VM crashes, please attach the error report file. By default the file name is hs_err_pidpid.log, where pid is the process ID of the process. --> unfortunately not available currently, as this has only been written to ephemeral storage of the fargate task container.

Thank you for considering this report! If there is additional information I can provide to help with resolving this, please do not hesitate to reach out!

earthling-amzn commented 2 years ago

This will be tough to troubleshoot without a core file or an hs_error log. Is there no way to get these artifacts from the fargate container? Perhaps using ECS Exec? could the container mount some durable storage? Perhaps using EFS with ECS?

Are you able to set the command line flags for the java process? Could you try running with -XX:+ErrorFileToStdout?

aablsk commented 2 years ago

Thanks for the quick reply, @earthling-amzn!

I'll set up tooling to be prepared for the next crash and report back. Due to the irregularity of the crashes it might take a few days until I have more data. Thank you for your patience and understanding!

aablsk commented 2 years ago

@earthling-amzn Good news! We've been able to observe another crash and your proposed option with -XX:+ErrorFileToStdout resulted in an error log (see below). Please note that I have removed some information and marked it with {REDACTED}.

With my limited understanding, it seems to be related to our use of Kotlin Co-Routine Flows, specifically the collect() method (at least this instance of the issue)?

Please do not hesitate to reach out, if I can support the process!

Thank you for your time and effort!

error_log_jvm_crash_corretto_17.log

earthling-amzn commented 2 years ago

Thank you for sharing the crash log. To me, it looks like an issue with C2. I'm not very familiar with Kotlin Co-Routine Flows, so it would be helpful if you had a small bit of code to reproduce the crash. Do you know of any public projects that might use Flows? I could look there for benchmarks or tests to reproduce the crash.

earthling-amzn commented 2 years ago

It would be helpful to have the replay log from the compiler, could you have the JVM write that file out to persistent storage with -XX:ReplayDataFile=<path>? Are you able to exercise this code outside of a container? If we gave you a fastdebug build of the JVM (i.e., one with assertions enabled), would you be able to run that in your container?

DataDog agent also does a fair amount of byte code instrumentation which could also confuse the JIT compiler. You might want to explore options there to disable instrumentation.

aablsk commented 2 years ago

@earthling-amzn thanks again for the quick response!

Reproduction Unfortunately we still have not found a reliable way to reproduce the issue, which makes it very hard to build a limited scope reproduction code example. We still have not been able to reproduce the issue locally either, which might either be bad luck or some difference in environment (OSX + ARM locally vs Linux + x64 in our deployments). As soon as we find a reliable way to reproduce I will make sure to build a minimal reproduction example and share it with you.

Public projects I'm not aware of any projects that use something akin to our usage which consists of spring-reactor + kotlin co-routines in this case. I'll do some research on the weekend on this topic and share my findings.

Compiler replay log We've added the requested flag and waiting for another occurrence of the issue. Will report back as soon as I have more data.

Exercise code outside of a container Yes we're able to do this, but as mentioned before have not been able to reproduce the crash outside of a container running in Fargate.

Fastdebug build We should be able to run in our staging environment with a fastdebug build of the JVM, if you could provide that either as a AL2 docker image, that we can build upon (preferred as it is closer to our usage) or as binaries upon which we could build our own AL2+Corretto base image.

DataDog agent I will have a look at this, thanks for the advice!

Thanks again for your hard work and support on this issue!

simonis commented 2 years ago

I just want to clarify that we don't want to blame the DataDog agent for being responsible for the crash. It's just that through instrumentation the agent might create unusual byetcode patterns which the JIT compiler might be not prepared for. Excluding (or not) the DD agent as a reason for this crash might help to isolate the problem and potentially create a reproducer.

Thanks for your support, Volker

aablsk commented 2 years ago

Thanks for the clarification, Volker!

I'd like to ensure that I'm able to provide individual data for each change I'm making. Since the crashes are highly infrequent, it will probably take some time until I've been able to gather data on the different scenarios.

Scenario 1 (currently waiting for crash): no changes, capture compiler log Scenario 2: exclude datadog agent Scenario 3: include fastdebug JVM build(?)

earthling-amzn commented 2 years ago

Here is a link to download a fastdebug build. The link will expire in 7 days (Feb 28th, 2022). Please note that although the fastdebug build is an optimized build, it has asserts enabled so it will run somewhat slower than the release build. It's to be hoped that an assert will catch the condition leading to the crash before it crashes and then terminate the VM with a helpful message.

aablsk commented 2 years ago

@earthling-amzn Thank you for providing the fastdebug build! Unfortunately I get a ExpiredToken error when trying to access the link. Could you please re-generate the link?

Thanks in advance!

earthling-amzn commented 2 years ago

Sorry about that. Try this one.

earthling-amzn commented 2 years ago

Have you seen this crash in earlier versions of the JDK?

aablsk commented 2 years ago

Thank you, the second link worked. I'll probably set it up tomorrow (due to meetings today) and a team mate of mine should be in touch soon.

We've only seen this issue in JDK 17. We've been recently upgrading from Corretto 11 to Corretto 17. We've only seen this happen in this specific service. Setup for services is pretty similar (Spring Boot + Kotlin + DataDog Agent on ECS).

aablsk commented 2 years ago

Unfortunately we've not been able to capture the compiler replay log with -XX:ReplayDataFile= as the process seems to be terminated before this can happen.

We've integrated the fastdebug build in one of our environments and will report back with more information on the next occurrence of the issue.

Please note that a colleague will continue the communication with you as I will be leaving the team. Thank you for your understanding!

fknrio commented 2 years ago

@earthling-amzn It's been a time, but we had to try out a few things... We excluded the Datadog agent and let it run for a while with the fastdebug build. Now we could reproduce the crash once with your provided fastdebug build.

Find the log file here (some information has been anonymized): jvm-crash-2022-04-11.log

Probably you're mainly interested in the following?

#  Internal Error (/home/jenkins/node/workspace/Corretto17/generic_linux/x64/build/Corretto17Src/installers/linux/universal/tar/corretto-build/buildRoot/src/hotspot/share/c1/c1_Instruction.cpp:848), pid=1, tid=22
#  assert(existing_value == new_state->local_at(index) || (existing_value->as_Phi() != __null && existing_value->as_Phi()->as_Phi()->block() == this)) failed: phi function required

Hope this helps! Let me know in case of further questions, as I'll take up the communication from @aablsk.

earthling-amzn commented 2 years ago

That's very interesting and helps narrow the search. I don't suppose you have the compilation replay file given by -XX:ReplayDataFile=./replay_pid1.log ?

earthling-amzn commented 2 years ago

This crash sure looks like: Hotspot C1 compiler crashes on Kotlin suspend fun with loop Which is patched in the 17.0.3 release. 17.0.3 is scheduled for release on April 19th, 2022.

This is all good news, but I'm a little concerned that the original crash for this issue was in C2. You might want to disable tiered compilation with -XX:-TieredCompilation. This will effectively disable the C1 compiler (where this latest crash occurred) and will have all code compiled by C2 (where the crash in the original report occurred). Maybe just disable tiered compilation where you are running the fastdebug build?

fknrio commented 2 years ago

Thanks for the hint. And sorry, no I don't have the replay file.

I disabled tiered compilation when running the fastdebug build and will monitor if the crash occurs again.

fknrio commented 2 years ago

Since running the fastdebug build with disabled tierd compilation and without Datadog agent, the crash did not occur again on our development system. The system is not under high load though.

However, with JRE 17.0.3, the JVM still crashes on production:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000000, pid=1, tid=14
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.3.6.1 (17.0.3+6) (build 17.0.3+6-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.3.6.1 (17.0.3+6-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# C  0x0000000000000000
#
# Core dump will be written. Default location: //core.1
#
# If you would like to submit a bug report, please visit:
#   https://github.com/corretto/corretto-17/issues/
#
---------------  S U M M A R Y ------------
Command Line: -XX:MaxRAMPercentage=70 -XX:+ErrorFileToStdout -XX:ReplayDataFile=./replay_pid1.log -javaagent:./dd-java-agent.jar cloud.rio.marketplace.productactivation.ProductActivationApplicationKt
Host: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz, 2 cores, 1G, Amazon Linux release 2 (Karoo)
Time: Thu Apr 28 07:50:45 2022 UTC elapsed time: 79.554398 seconds (0d 0h 1m 19s)
---------------  T H R E A D  ---------------
Current thread (0x00007f79c806db50):  JavaThread "C2 CompilerThread0" daemon [_thread_in_native, id=14, stack(0x00007f799beff000,0x00007f799c000000)]
Current CompileTask:
C2:  79554 21820   !   4       kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl (410 bytes)
Stack: [0x00007f799beff000,0x00007f799c000000],  sp=0x00007f799bffba68,  free space=1010k
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000
...
earthling-amzn commented 2 years ago

Do you have the rest of that crash report? The replay file would also be very helpful to root cause the issue.

fknrio commented 2 years ago

Find the full crash report here.

I don't have the replay file unfortunately, because the service is running on AWS Fargate without a persistent volume.

navyxliu commented 2 years ago

This is the same error as your original reported. c2 fails to compile 'kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl (410 bytes)'.

With -XX:ReplayDataFile=./replay_pid1.log, it's very likely we can produce this error. is it possible that you write it somewhere with persistent storage?

fknrio commented 2 years ago

I implemented persisting the replay data file and will let you know when it is available.

fknrio commented 2 years ago

I now have a replay file at hand. Should I upload it here? Is it fine if I anonymize some information (i.e. replace the package names)? Otherwise, how can I provide you safely with this file? Or is there anything else I need to do?

earthling-amzn commented 2 years ago

You may anonymize the file and upload it here and we'll see how far we get with it. We'll also look into ways to better exchange confidential files.

fknrio commented 2 years ago

Here it is: 2022-05-05_replay_anonymized.log

I hope you can make some value out of it. Let me know if you need anything else.

navyxliu commented 2 years ago

hi, @fknrio I try to reproduce your replay file. One blocker is that your compilation unit contains 2 lambda classes.

 6 16 reactor/core/publisher/Mono$$Lambda$2661+0x0000000801a70570 <init> 
 6 16 reactor/core/publisher/Flux$$Lambda$2755+0x0000000801ab0450 <init> 

Those classes are generated dynamically. I don't have the class files so we can't trigger compilation on my side. Here is one workaround for this issue. You can pass the following option to java. 'DUMP_CLASS_FILES' is a directory and you need to create it before executing. This will force java to dump all lambda classes to 'DUMP_CLASS_FILES'.

-Djdk.internal.lambda.dumpProxyClasses=DUMP_CLASS_FILES

Can you try that? or have a simple reproducible ( either in source code or a jar file) so we can step into it?

navyxliu commented 2 years ago

hi, @fknrio, It's also possible to recover the missing classes from a corefile. if it's difficult to reproduce this problem from source code, how about you share the coredump file with us?

fknrio commented 2 years ago

Hi @navyxliu, I configured the appropriate options and will share the respective files once the crash occurs the next time.

Unfortunately, I don't have a simple reproducible, because also for us it only happens in a single service (although others are built very similar).

fknrio commented 2 years ago

Hi @navyxliu, I dumped the proxy classes (and excluded the com.example package): class_dump.tar.gz together with this replay.log

Hope this already helps?

navyxliu commented 2 years ago

in your replay file, line 14130, this should be an entry of VirtualCallData. however, 0x7f6a5841c470 isn't a valid ciKlass. it's not even in the mapped area. 0x70005 0x4d55 0x0 0x7f6a5841c3c0 0xa3 0x7f6a5841c470

Could you try it with -XX:TypeProfileWidth=0? This option disables type profiling for virtual methods. If this tweak let you escape crash, then we can locate the culprit.

So far, our theory is that JVM fails to retain this class in metaspace, or unload it prematurely. C2 crashes later on when it attempts to use this as a speculative receiver type.

edeesis commented 2 years ago

We recently ran into this error as well.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000003fd6, pid=1, tid=8
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.3.6.1 (17.0.3+6) (build 17.0.3+6-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.3.6.1 (17.0.3+6-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# C  0x0000000000003fd6
#
# Core dump will be written. Default location: /home/app/core.1
#
# An error report file with more information is saved as:
# /home/app/hs_err_pid1.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/corretto/corretto-17/issues/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Using:

Micronaut, R2DBC (with Kotlin Coroutines). This was using amazoncorretto:17-alpine as my base image.

It also happened when using eclipse-temurin:17-alpine

It didn't happen when I switched to the eclipse-temurin:17, nor amazoncorreto:17, so I wonder if it's happening because of an incompatibility between Alpine Linux and Kotlin Coroutines.

Strangely, when running the container on my local machine (MacOS running Docker for Mac), it doesn't seem to have the same problem.

fknrio commented 2 years ago

Hi @navyxliu, I set -XX:TypeProfileWidth=0 and it crashed again: error_log_anonymized.log, replay_anonymized.log and class_dump.tar.gz

navyxliu commented 2 years ago

@edeesis , The most distinguishing feature of alpine is musl-c. Gnu/Linux uses glibc.

I am not sure your case is same as @fknrio here. He uses containerized Amazon Linux 2. he got a crash in C2 compiler thread.

Could you upload hs_err_pid1.log or even the coredump file?

ghost commented 2 years ago

We're having the same crash on Corretto-17.0.3.6.1 from the amazoncorretto:17 image(not the alpine one)

Current thread (0x00007f06380e40b0):  JavaThread "C2 CompilerThread0" daemon [_thread_in_native, id=22, stack(0x00007f061c87d000,0x00007f061c97e000)]

Current CompileTask:
C2:30514161 10686   !   4       kotlinx.coroutines.flow.AbstractFlow::collect (189 bytes)
navyxliu commented 2 years ago

@fknrio , thank you for your patience. This is a really tricky problem. I analyzed your crash report and replay file. first of all, it looks like this time your app last longer(47m). The replay file is still broken. I file a JBS issue about it: JDK-8287046

@robert-csdisco case looks like very similiar to yours. His app crashed at AbstractFlow::collect, close but not exactly same. if you guys have a way to reproduce this problem, that would be super helpful.

I will try to build debuginfo of 17.0.3.6.1 and see if I understand RSP[0] better.

fknrio commented 2 years ago

@navyxliu Thank you for looking into it and filing an upstream issue. I have the coredump at hand, but cannot share it due to confidential information. If you require some information, just let me know (with providing details how to get them).

navyxliu commented 2 years ago

hi, @elizarov,

Some customers report that they observe PC become zero or near zero in C2CompilerThread. C2 has trouble to compile kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl, or "collect".

The closure of classfiles is something like this from @fknrio 's report. Have you seen this before? I wonder if you have a reproducible of this in your bug database.

-cp ./kotlin-stdlib-1.6.21.jar:kotlinx-coroutines-reactor-1.5.2.jar:kotlinx-coroutines-reactive-1.5.2.jar:kotlinx-coroutines-core-jvm-1.5.2.jar:kotlin-stdlib-jdk8-1.6.21.jar

Thank you. --lx

elizarov commented 2 years ago

navyxliu I have not seen this particular one before.

navyxliu commented 2 years ago

hi, @fknrio ,

I think I am stuck here. so far, all I know is that argument sub_t of SubTypeCheckNode::sub() is neither klass_ptr nor oop_ptr. it's an any_ptr. see details in here. I don't how this happened, maybe kotlinc generates different code.

I can't process your file with sensitive data. If I share the debuginfo file of JVM with you, is that okay you use gdb to load up coredump and give us stacktraces? or could you work on a coredump without sensitive data?

thanks, --lx

fknrio commented 2 years ago

Hi @navyxliu, thanks for your analysis. If you guide me, I can share stacktraces. Or we do a session online together? (Which might be less back and forth).

navyxliu commented 2 years ago

hi, @fknrio , We will post a wiki page about how to load a coredump file in gdb and resolve symbols using Corretto debuginfo. Stay tuned.

--lx

fknrio commented 2 years ago

Hi @navyxliu, any update on this? We still face these crashes.

navyxliu commented 2 years ago

hi, @fknrio , I need a reproducible or at least the stacktrace of crash thread to debug this issue. here i wrote a quick note how to parse the coredump along with the executable. Can you try that in the same docker image?

Start with gdb

What do you need in to analyze a coredump? Essentially, you need only the executable and the coredump. Corretto binaries ship with symbols, which will help you to decode stacktraces. Debuginfo files are optional. They provide DWARF information and will help you understand optimized code and frames.

Theoretically, it's possible to do coredump analysis on different platforms. That would require extra care about system libraries. At minimum, you need to prepare the symbols and debuginfo of Libc. For simplicity, we assume you are using the exact same Linux system which generated the coredump file.

To parse the coredump file, you need the exactly same Java executable, otherwise the symbols and their offsets will not match precisely, and you may not be able to correctly parse the coredump. In this example, we are using Corretto 17.0.3.6.1 for Linux x86_64. Here is the command.

gdb ./amazon-corretto-17.0.3.6.1-linux-x64/bin/java /tmp/core.107410.107411

After gdb loads it, you can dump the stacktrace of any thread. To switch to other thread, use 'thread id'. Use 'info thread' to check all threads.

[Current thread is 1 (Thread 0x7f0a83e58700 (LWP 107411))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f0a83297148 in __GI_abort () at abort.c:79
#2  0x00007f0a828820f5 in os::abort(bool, void*, void const*) () from /local/home/xxinliu/Devel/AnalyzingHotSpotCrashes/examples/amazon-corretto-17.0.3.6.1-linux-x64/lib/server/libjvm.so
#3  0x00007f0a82bb8c60 in VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long) () from /local/home/xxinliu/Devel/AnalyzingHotSpotCrashes/examples/amazon-corretto-17.0.3.6.1-linux-x64/lib/server/libjvm.so
#4  0x00007f0a82bb973b in VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*, char const*, ...) () from /local/home/xxinliu/Devel/AnalyzingHotSpotCrashes/examples/amazon-corretto-17.0.3.6.1-linux-x64/lib/server/libjvm.so
#5  0x00007f0a82bb976e in VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*) () from /local/home/xxinliu/Devel/AnalyzingHotSpotCrashes/examples/amazon-corretto-17.0.3.6.1-linux-x64/lib/server/libjvm.so
#6  0x00007f0a82a597ee in JVM_handle_linux_signal () from /local/home/xxinliu/Devel/AnalyzingHotSpotCrashes/examples/amazon-corretto-17.0.3.6.1-linux-x64/lib/server/libjvm.so
#7  <signal handler called>
#8  0x00007f0a82b6aceb in Unsafe_PutInt () from /local/home/xxinliu/Devel/AnalyzingHotSpotCrashes/examples/amazon-corretto-17.0.3.6.1-linux-x64/lib/server/libjvm.so
#9  0x00007f0a6549d53a in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb)

In this example, we can see that Unsafe_putInt of libjvm.so triggered a segment fault. If we were lucky, we would know what was wrong. If not, we need to resort to DWARF info and inspect individual frames. That would require you to obtain the debuginfo of libjvm.so first. We are preparing debuginfo files and will start shipping them in next release. Contact me if you need the debuginfo files of current release.

navyxliu commented 2 years ago

if this bug is critical for you, you can workaround it using the follow command. This will disable the very compilation which triggered the crash in your application.

-XX:CompileCommand=exclude,kotlinx.coroutines.flow.AbstractFlow::collect

fknrio commented 2 years ago

Hi @navyxliu, thank you for the description.

I tried the flag -XX:CompileCommand=exclude,kotlinx.coroutines.flow.AbstractFlow::collect but the JVM still crashed. Shouldn't it be kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl instead?

For analysis: I ran a shell in the exact docker image that crashed, installed gdb and used the generated coredump file, and here is the stacktrace of the crashing thread:

(gdb) bt
#0  0x00007fb477b5eca0 in raise () from /lib64/libc.so.6
#1  0x00007fb477b60148 in abort () from /lib64/libc.so.6
#2  0x00007fb47714b0f5 in os::abort(bool, void*, void const*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#3  0x00007fb477481c60 in VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#4  0x00007fb47748273b in VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*, char const*, ...) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#5  0x00007fb47748276e in VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#6  0x00007fb4773227ee in JVM_handle_linux_signal () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#7  <signal handler called>
#8  0x0000000000000000 in ?? ()
#9  0x00007fb47738c5dd in SubTypeCheckNode::sub(Type const*, Type const*) const () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#10 0x00007fb476d1a0e6 in split_if(IfNode*, PhaseIterGVN*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#11 0x00007fb476d20e2a in IfNode::Ideal(PhaseGVN*, bool) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#12 0x00007fb47718bdb9 in PhaseIterGVN::transform_old(Node*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#13 0x00007fb477188856 in PhaseIterGVN::optimize() () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#14 0x00007fb476aded48 in Compile::Optimize() () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#15 0x00007fb476ae082d in Compile::Compile(ciEnv*, ciMethod*, int, bool, bool, bool, bool, bool, DirectiveSet*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#16 0x00007fb476a11aea in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#17 0x00007fb476aea6ac in CompileBroker::invoke_compiler_on_method(CompileTask*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#18 0x00007fb476aeb398 in CompileBroker::compiler_thread_loop() () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#19 0x00007fb477402ece in JavaThread::run() () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#20 0x00007fb477405f72 in Thread::call_run() () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#21 0x00007fb47713f601 in thread_native_entry(Thread*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#22 0x00007fb4780e144b in start_thread () from /lib64/libpthread.so.0
#23 0x00007fb477c1840f in clone () from /lib64/libc.so.6

Does this help? Otherwise it seems I'm missing the corresponding debuginfo (at least gdb also complains about Missing separate debuginfo for /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so (and more). Can you provide the debuginfo of the current release (17.0.3.6.1)?

navyxliu commented 2 years ago

@fknrio , In your case, c2 has difficulty to compile 'kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl'. you can skip compilation using -XX:CompileCommand=exclude,kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl. That's a workaround.

The stacktrace looks reasonable to me. it's very similar what we found before.

The problem happens in ideal graph, which is the intermediate representation of C2. this is up to your code and profiling info. Without a reproducible, I can't get the ideal graph and reason why it has trouble it in SubTypeCheckNode::sub. In particular, what the IfNode looks like at frame-11.

If you can confirm that kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl is the only source of this problem using 'exclude' above. Try to record the log compilation of it.

-XX:+LogCompilation -XX:LogFile=broken_compilation.log -XX:CompileCommand=log,kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl

Make sure that you only fetch broken_compilation.log after the java process has terminated. compiler logs are only serialized in termination phrase.

fknrio commented 2 years ago

I implemented the workaround with PublisherAsFlow::collectImpl and will check if the crash also occurs with the exclusion defined. In parallel, I added the LogCompilation options in a second instance of the service without the exclusion.

Unfortunately it is not straightforward to create a reproducible without confidential data.

Will keep you posted.

fknrio commented 2 years ago

@navyxliu So far the workaround (-XX:CompileCommand=exclude,kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl) works fine.

I recorded the log compilation of a crash as you suggested and attach it [here]() in anonymized form (please download, link expires). Does this help?

karla-barraza commented 1 year ago

We are also seeing this issue on Linux and Windows. We will assess if we can try the workarounds mentioned above.

We’ve been able to reproduce this consistently with our production code. We have a specific integration test failing on Linux and Windows during JIT compilation (approximately 50% of the time)

OS: Linux, version 5.15.0-1022-aws and also Windows Server 2012 R2, version 6.3 Version: Corretto-17.0.4.9.1