DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
575 stars 287 forks source link

Java Agent Causing Jax-RS to error out #380

Closed sblumenthalAtBraze closed 4 years ago

sblumenthalAtBraze commented 6 years ago

Using the current released Java agent version (0.9.0) WITHOUT enabling the (still in beta) jetty or spark instrumentation causes my jetty application to error out when trying to create outgoing get requests.

The stack trace I get:

java.lang.NullPointerException: null
    at org.glassfish.jersey.model.internal.CommonConfig.configureFeatures(CommonConfig.java:709)
    at org.glassfish.jersey.model.internal.CommonConfig.configureMetaProviders(CommonConfig.java:648)
    at org.glassfish.jersey.client.ClientConfig$State.configureMetaProviders(ClientConfig.java:372)
    at org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:405)
    at org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:90)
    at org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:122)
    at org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:119)
    at org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
    at org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:733)
    at org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:286)
    at org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:135)
    at org.glassfish.jersey.client.JerseyInvocation.<init>(JerseyInvocation.java:105)
    at org.glassfish.jersey.client.JerseyInvocation.<init>(JerseyInvocation.java:101)
    at org.glassfish.jersey.client.JerseyInvocation.<init>(JerseyInvocation.java:92)
    at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:411)
    at org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:311)
    at <application code, call to client.target(uri).request().get()>
    at spark.RouteImpl$1.handle(RouteImpl.java:72)
    at spark.http.matching.Routes.execute(Routes.java:61)
    at spark.http.matching.MatcherFilter.doFilter(MatcherFilter.java:130)
    at spark.embeddedserver.jetty.JettyHandler.doHandle(JettyHandler.java:50)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:189)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:499)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:258)
    at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
    at java.lang.Thread.run(Thread.java:745)

This error DOES NOT occur when the Datadog Java agent is not present on the box. The javaagent string I am using is:

-javaagent:/opt/dd-java-agent.jar -Ddd.service.name=$DD_SERVICE_NAME -Ddd.agent.host=$DD_HOST_IP

And the agent running on the box is set up using the chef-datadog recipe. I have this code deployed to a handful of servers, and deploying several times a day. It is usually only one of the servers that errors out on all outgoing requests, and the server changes with each new deploy

Edit: Some more details, the application is running in a docker container and is sending the trace data to a datadog agent running on the host machine. The application uses spark-core 2.6.0

realark commented 6 years ago

Hi @sblumenthalAtBraze.

Can you provide the following info:

That will tell us where to look to investigate.

sblumenthalAtBraze commented 6 years ago

@realark can do!

Like I said, I am using spark-core 2.6.0. The corresponding jetty and jersey client versions that are pulled from that are:

Gradle: org.eclipse.jetty.websocket:websocket-api:9.4.4.v20170414
Gradle: org.eclipse.jetty.websocket:websocket-client:9.4.4.v20170414
Gradle: org.eclipse.jetty.websocket:websocket-common:9.4.4.v20170414
Gradle: org.eclipse.jetty.websocket:websocket-server:9.4.4.v20170414
Gradle: org.eclipse.jetty.websocket:websocket-servlet:9.4.4.v20170414
Gradle: org.eclipse.jetty:jetty-client:9.4.4.v20170414
Gradle: org.eclipse.jetty:jetty-continuation:9.2.22.v20170606
Gradle: org.eclipse.jetty:jetty-http:9.4.4.v20170414
Gradle: org.eclipse.jetty:jetty-io:9.4.4.v20170414
Gradle: org.eclipse.jetty:jetty-jaas:9.2.22.v20170606
Gradle: org.eclipse.jetty:jetty-jmx:9.2.22.v20170606
Gradle: org.eclipse.jetty:jetty-security:9.4.4.v20170414
Gradle: org.eclipse.jetty:jetty-server:9.4.4.v20170414
Gradle: org.eclipse.jetty:jetty-servlet:9.4.4.v20170414
Gradle: org.eclipse.jetty:jetty-servlets:9.2.22.v20170606
Gradle: org.eclipse.jetty:jetty-util:9.4.4.v20170414
Gradle: org.eclipse.jetty:jetty-webapp:9.4.4.v20170414
Gradle: org.eclipse.jetty:jetty-xml:9.4.4.v20170414
Gradle: org.glassfish.hk2.external:aopalliance-repackaged:2.5.0-b30
Gradle: org.glassfish.hk2.external:javax.inject:2.5.0-b30
Gradle: org.glassfish.hk2:hk2-api:2.5.0-b30
Gradle: org.glassfish.hk2:hk2-locator:2.5.0-b30
Gradle: org.glassfish.hk2:hk2-utils:2.5.0-b30
Gradle: org.glassfish.hk2:osgi-resource-locator:1.0.1
Gradle: org.glassfish.jersey.bundles.repackaged:jersey-guava:2.25
Gradle: org.glassfish.jersey.containers:jersey-container-servlet:2.25
Gradle: org.glassfish.jersey.containers:jersey-container-servlet-core:2.25
Gradle: org.glassfish.jersey.core:jersey-client:2.25
Gradle: org.glassfish.jersey.core:jersey-common:2.25
Gradle: org.glassfish.jersey.core:jersey-server:2.25
Gradle: org.glassfish.jersey.ext:jersey-bean-validation:2.25
Gradle: org.glassfish.jersey.media:jersey-media-jaxb:2.25
Gradle: org.glassfish.web:javax.el:2.2.4

Aside from that, the other dependencies for the app are:

  compile "org.apache.kafka:kafka-clients:1.0.0-cp1"
  compile "org.apache.kafka:kafka-streams:1.0.0-cp1"
  compile "io.confluent:kafka-avro-serializer:4.0.0"
  compile "io.confluent:kafka-json-serializer:4.0.0"
  compile "io.confluent:kafka-schema-registry-client:4.0.0"
  compile "org.apache.avro:avro:1.7.7"
  compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.8.4'
  compile 'com.google.guava:guava:20.0'
  compile 'com.google.code.gson:gson:2.8.0'
  compile 'com.datadoghq:java-dogstatsd-client:2.3'
  compile 'javax.ws.rs:javax.ws.rs-api:2.0.1'
  compile 'com.sparkjava:spark-core:2.6.0'
  compile 'org.glassfish.jersey.core:jersey-common:2.25'
  compile 'org.glassfish.jersey.core:jersey-client:2.25'
  compile 'org.jetbrains:annotations-java5:15.0'
  compile 'io.sentry:sentry-log4j:1.7.5'
  compile 'com.datadoghq:dd-trace-api:0.9.0'
mar-kolya commented 6 years ago

@sblumenthalAtBraze Thanks for your response!

Could you please also try to add -Ddd.integration.jax-rs.enabled=false to the JVM arguments and see if that helps that problem at all?

sblumenthalAtBraze commented 6 years ago

@mar-kolya That actually appears to have fixed the issue! Not quite sure why the out of the box integration would cause the problem, but it doesn't appear that any additional data is being lost, so no harm done then!

I will keep an eye on it for now to make sure nothing goes wrong, but it looks like I might be good

mar-kolya commented 6 years ago

@sblumenthalAtBraze Thanks!

I think this gives us some information to look into fixing this properly. I hope this is 'good enough' as a workaround for now.

Thanks for reporting the problem!

mar-kolya commented 6 years ago

@sblumenthalAtBraze

Could you please provide us with details about how you create client (as in at <application code, call to client.target(uri).request().get()>) object? Do you have any features enabled? Any special configuration? Anything would be helpful.

Thanks!

mar-kolya commented 6 years ago

Could you please also share version of your JVM and any logs of the DataDog java agent start up that you can (when -Ddd.integration.jax-rs.enabled is set to true)?

Unfortunately so far we cannot reproduce the problem even with exact same library versions that you are using.

Thanks!

tylerbenson commented 6 years ago

@sblumenthalAtBraze We are still unable to figure out what is going on here. Reviewing the code in the framework, it seems impossible to get into this state. Are you able to reproduce this problem locally? If so, perhaps you have some time one of us could jump on a call to go through it in a debugger.

Thanks!

tylerbenson commented 6 years ago

@sblumenthalAtBraze We've made quite a bit of changes since 0.9.0. Would you be willing to try again with the latest release and see if the issue has been resolved? You can also try out this build from #479 that I did to fix the problem reported in #437 that might help you as well.

sblumenthalAtBraze commented 5 years ago

So, I figure I will post an update here, because I tried enabling the jax-rs integration this week using agent version 0.25 and ran into the same problem.

What is happening is that, SOMETIMES, not always, a redeploy of the code would result in 100% of requests throwing the same stacktrace as the one posted at the top.

Since I had first opened this ticket, I noticed that we were not packaging all of the required runtime dependencies that our application needed into the docker container that we were creating. Instead, we were cherry-picking a small handful of them and wget'ing them in the container. So, I fixed that and thought that might have been the issue behind this. However, the NPE still seems to throw sometimes.

Then, I noticed that the stacktrace was indicating line numbers consistent with jersey-common version 2.27. My code lists 2.25 as a dependency, yet both versions were being loaded into the classpath. So, I thought maybe it was some version mismatch, and upgraded the version to 2.27 to be consistent. Yet, the issue would still rarely occur.

I was unable to figure out what during deployment and startup was causing the app to get into this broken state, and as a result I can not reliably reproduce it, but it was happening daily in prod, so today I re-enabled the integration.

With all of that said, looking at the traces that were produced when the integration was enabled, it looks like the information that was contained in the span for the jax-rs objects were identical to those created by the http.request span, and both were being reported with the exact same timestamp in each trace. Personally, I do not need both, and I am going to be leaving this integration disabled as a result anyway, but I wanted to give a status update that the issue still persists for me.