Open andreasparteli opened 3 months ago
Hi, thanks for reporting the issue, but with the given info, we still cannot locate where the issue is. The debug level log shows everything was just working as expected before crash.
Firstly, if you can provide some steps to reproduce, that will be very helpful.
2024-06-24 12:17:55.643,# Core dump will be written. Default location: /app/core.1
2024-06-24 12:17:55.643,#
2024-06-24 12:17:55.643,# An error report file with more information is saved as:
2024-06-24 12:17:55.643,# /app/hs_err_pid1.log
Secondly, /app/core.1
and /app/hs_err_pid1.log
Those two file will also be helpful for us to track the bug.
Lastly, if possible, can you enable the trace level log? Looks like you have the debug level log, and it doesn't have enough information about the lifetime that may cause the crash. However, it's very likely that the log will not be very helpful, but can provide a bit more information about what happened.
Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.
We have the same issue on a docker container. Did you ever find the solution for this? OpenJDK 11, eclipse-temurin:11-jre-alpine, CRT client version 2.25.31
No, we didn't find a solution unfortunately. And since it was running in a Fargate task, access to the error log files wasn't really possible/feasible. So we ended up switching to Netty in the end.
@ZAlex1988 can you provide more details about your docker container? A minimal reproducible example with a dockerfile and code snippet would help us track this bug down.
Some details for the container where the failures were observed: eclipse-temurin:11-jre-alpine Application running on a wildfly server. I cannot provide a sample reproducible configuration, as the error happens seemingly randomly. Currently we have changed the container type, and updated the AWS SDK version (from 2.25.31 to 2.26.25). I will update if we see this error again. It's hard to reproduce.
Similar to @ZAlex1988 - running a Wildfly server using Alpine Linux, java 11. I have attached the err_pid.log file. It seems to occur in low memory situations when the application has been running for a few minutes and is under moderate load. dump_local.txt
We have enabled the core dumps. The dump file is large, please see it at https://drive.google.com/file/d/1Z76Cx0zHElyi6XhD8Q7t1wBa3lHzYTAY/view?usp=sharing
Potentially our issue was caused by a memory leak in another library. Could that be an explanation?
After replacing AWS CRT by Netty IO we found and fixed a memory leak. Since I was curious if missing memory could have led to the SIGSEGV
, we switched back to AWS CRT and it's running fine for a week now.
Describe the bug
At a seemingly random time our Java Process dies due to a
SIGSEGV
in AWS CRT:It's hard to reproduce as the affected service runs for hours at times executing the same code that later causes the crash.
This was already described in https://github.com/awslabs/aws-crt-java/issues/763. I managed to get a CRT debug log and am still investigating further.
Expected Behavior
No crash
Current Behavior
Reproduction Steps
Not reproducible deterministically right now
Possible Solution
No response
Additional Information/Context
Runs in a Fargate task for us
aws-crt-java version used
2.26.7 (happened for 2.21.29 as well)
Java version used
OpenJDK Runtime Environment Corretto-11.0.23.9.1
Operating System and version
amazoncorretto:11-alpine