Open nicolas-vivot opened 3 years ago
Hi @nicolas-vivot, thanks for the detailed report. Currently we test the agent on lots of JVMs, but not on any GraalVM. We will look into what's required to run correctly on GraalVM. Since this seems to be originating in one of our dependencies, the resolution may be easy, but I can't promise anything until we're testing against GraalVM.
Hi @nicolas-vivot. I took a quick look at creating a Graal Native image for an application that uses dd-trace-ot and dd-trace-api. I can reproduce the "warnings" during the generation of a native application. I believe those warnings are likely the cause of the segmentation faults you are seeing. I have not been able to reproduce a segmentation fault at this point but I may not be stressing the right area in my simple test case. Do you have a reproducible test case? What version of GraalVM are you using?
Hi @charliegracie
I'm using version 21.0.0 Java 11
I don't have a reproducer project yet, let me set this up today and i will provide it here.
@charliegracie
You can find a reproducer here : https://github.com/nicolas-vivot/datadog-trace-java-graalvm-segfault-reproducer
Hi @nicolas-vivot. I was not able to reproduce the crash with the test case. I ran the test with Postman 100,000s of times. My assumption is it is likely based on timing or architecture. The warnings printed about JCTools code during Graal Native generation could cause this type of issue. I forked the reproducer and created a test patch to verify that if the warnings in JCTools are resolved that the crash goes away. Would you be able to test it?
My repo is here: https://github.com/charliegracie/datadog-trace-java-graalvm-segfault-reproducer
The branch is: crash_fix
This test is not complete but it should make the program functionally correct. If this resolves the issue I will work to figure out the proper way to resolve this issue.
Hi @charliegracie
It's strange you could not reproduce it. I can reproduce 100% of time within a few tests only (within the firsts 100 requests usually) If i remember well, the problem is happening when or after processing the second queue. You might not reach that use case when trying to reproduce it (?)
Anyway, your substitution fixes the problem. With your version i do not reproduce it anymore (i ran dozens of thousand requests)
Good to know you will fix it directly on the sources. I'm also going to include the substitution on my quarkus dd-trace-java extension until the fix is released.
Thank you very much for your help !
Hi team, is there any progress on running datadog agents with native images?
I recently moved from Spring Boot to Quarkus, using dd-trace-api and dd-trace-ot to have tracing with datadog inside my application without the java agent. (i'm using instrumentation package from the open tracing community instead)
When i run in JVM mode, i have no problem. But when it comes to the native mode running on GraalVM, i faced Segmentation Fault errors after a while. This happens when managing traces to send to the datadog agent.
Here is the stack trace:
Have ever compiled and run the datadog libraries (the core & ot module) on GraalVM ? Do you support it ? Do you have an idea where the problem is ?
In addition, at build time, Quarkus complains about the usage of these classes (which seems to be the cause on the stack trace) and that it cannot substitute
Any help would be appreciated. That's the last problem i have to have dd-trace working with quarkus on GraalVM. I was thinking to create a Quarkus extension after that if everything works well to offer this to the community. So you would definitely benefit from that as well since i'm not the only one interested in Quarkus migration and having datadog for the tracing.
PS: i know that i can also pass by open telemetry, but this project is still young, especially on Quarkus there is no official extension yet for that, plus your datadog exporter for the open telemetry agent is not yet capable of handling logs, which would force to run both the datadog agent & the open telemetry collector on our kubernetes clusters - waste of resources.