DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
583 stars 290 forks source link

SegFault when running dd-trace-java on GraalVM (native) #2463

Open nicolas-vivot opened 3 years ago

nicolas-vivot commented 3 years ago

I recently moved from Spring Boot to Quarkus, using dd-trace-api and dd-trace-ot to have tracing with datadog inside my application without the java agent. (i'm using instrumentation package from the open tracing community instead)

When i run in JVM mode, i have no problem. But when it comes to the native mode running on GraalVM, i faced Segmentation Fault errors after a while. This happens when managing traces to send to the datadog agent.

Here is the stack trace: image

Have ever compiled and run the datadog libraries (the core & ot module) on GraalVM ? Do you support it ? Do you have an idea where the problem is ?

In addition, at build time, Quarkus complains about the usage of these classes (which seems to be the cause on the stack trace) and that it cannot substitute image

Any help would be appreciated. That's the last problem i have to have dd-trace working with quarkus on GraalVM. I was thinking to create a Quarkus extension after that if everything works well to offer this to the community. So you would definitely benefit from that as well since i'm not the only one interested in Quarkus migration and having datadog for the tracing.

PS: i know that i can also pass by open telemetry, but this project is still young, especially on Quarkus there is no official extension yet for that, plus your datadog exporter for the open telemetry agent is not yet capable of handling logs, which would force to run both the datadog agent & the open telemetry collector on our kubernetes clusters - waste of resources.

richardstartin commented 3 years ago

Hi @nicolas-vivot, thanks for the detailed report. Currently we test the agent on lots of JVMs, but not on any GraalVM. We will look into what's required to run correctly on GraalVM. Since this seems to be originating in one of our dependencies, the resolution may be easy, but I can't promise anything until we're testing against GraalVM.

charliegracie commented 3 years ago

Hi @nicolas-vivot. I took a quick look at creating a Graal Native image for an application that uses dd-trace-ot and dd-trace-api. I can reproduce the "warnings" during the generation of a native application. I believe those warnings are likely the cause of the segmentation faults you are seeing. I have not been able to reproduce a segmentation fault at this point but I may not be stressing the right area in my simple test case. Do you have a reproducible test case? What version of GraalVM are you using?

nicolas-vivot commented 3 years ago

Hi @charliegracie

I'm using version 21.0.0 Java 11

I don't have a reproducer project yet, let me set this up today and i will provide it here.

nicolas-vivot commented 3 years ago

@charliegracie

You can find a reproducer here : https://github.com/nicolas-vivot/datadog-trace-java-graalvm-segfault-reproducer

charliegracie commented 3 years ago

Hi @nicolas-vivot. I was not able to reproduce the crash with the test case. I ran the test with Postman 100,000s of times. My assumption is it is likely based on timing or architecture. The warnings printed about JCTools code during Graal Native generation could cause this type of issue. I forked the reproducer and created a test patch to verify that if the warnings in JCTools are resolved that the crash goes away. Would you be able to test it?

My repo is here: https://github.com/charliegracie/datadog-trace-java-graalvm-segfault-reproducer

The branch is: crash_fix

This test is not complete but it should make the program functionally correct. If this resolves the issue I will work to figure out the proper way to resolve this issue.

nicolas-vivot commented 3 years ago

Hi @charliegracie

It's strange you could not reproduce it. I can reproduce 100% of time within a few tests only (within the firsts 100 requests usually) If i remember well, the problem is happening when or after processing the second queue. You might not reach that use case when trying to reproduce it (?)

Anyway, your substitution fixes the problem. With your version i do not reproduce it anymore (i ran dozens of thousand requests)

Good to know you will fix it directly on the sources. I'm also going to include the substitution on my quarkus dd-trace-java extension until the fix is released.

Thank you very much for your help !

VishGov commented 1 year ago

Hi team, is there any progress on running datadog agents with native images?