DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
572 stars 284 forks source link

Unable to disable tracing via env var, and connection times out after 30 seconds rather than default #7599

Open lukesturgis opened 2 weeks ago

lukesturgis commented 2 weeks ago

Hello, I'm having a couple of issues with the Java trace API.

  1. When setting the environment variable DD_TRACE_ENABLED to false, the tracing library still attempts to initiate a connection to an agent endpoint. In the event it cannot connect, this will take up to the full timeout to exit, whereas I'd expect it to disable all tracing.
    
    user@host:~$ export DD_TRACE_ENABLED=false
    user@host:~$ time <long java start command> -pdd -Ddd.trace.health.metrics.enabled=true -Ddd.agent.host=localhost java/lib/hellojava.jar
    OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
    [dd.trace 2024-09-10 20:18:39:649 +0000] [main] DEBUG datadog.trace.agent.tooling.AgentInstaller - Adding filtered classes - inst
    rumentation.class=datadog.trace.instrumentation.akka.concurrent.AkkaForkJoinTaskInstrumentation
    [dd.trace 2024-09-10 20:18:39:652 +0000] [main] DEBUG datadog.trace.agent.tooling.AgentInstaller - Adding filtered classes - inst
    rumentation.class=datadog.trace.instrumentation.akka.concurrent.AkkaMailboxInstrumentation
    [dd.trace 2024-09-10 20:18:39:686 +0000] [main] DEBUG datadog.trace.agent.tooling.AgentInstaller - Adding filtered classes - inst
    rumentation.class=datadog.trace.instrumentation.googlepubsub.PublisherInstrumentation
    [dd.trace 2024-09-10 20:18:39:716 +0000] [main] DEBUG datadog.trace.agent.tooling.AgentInstaller - Adding filtered classes - inst
    rumentation.class=datadog.trace.instrumentation.java.concurrent.JavaForkJoinTaskInstrumentation

...

[dd.trace 2024-09-10 20:19:23:055 +0000] [main] DEBUG datadog.trace.agent.tooling.TracerInstaller - Tracing is disabled, not installing GlobalTracer.

...

[dd.trace 2024-09-10 20:19:23:389 +0000] [main] EXCLUDE_TELEMETRY datadog.communication.ddagent.DDAgentFeaturesDiscovery - Error querying info at http://localhost:8126/

real 0m1.994s user 0m2.931s sys 0m0.271s


2. After disabling the agent running on localhost, when I set the `agent.host` to localhost the connection fails within less than 3 seconds and exits:

user@host:~$ time -pdd -Ddd.trace.health.metrics.enabled=true -Ddd.agent.host=localhost java/lib/hellojava.jar OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended [dd.trace 2024-09-10 20:14:32:022 +0000] [dd-telemetry] WARN datadog.telemetry.TelemetryRouter - Got FAILURE sending telemetry request to http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.

real 0m2.638s user 0m4.009s sys 0m0.334s


However, if I set it to a host that does not exist over the ethernet device, it will attempt to connect and take over 30 seconds to fail:

user@host:~$ time -pdd -Ddd.trace.health.metrics.enabled=true -Ddd.agent.host=this-host-doesnt-exist.com java/lib/hellojava.jar OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended [dd.trace 2024-09-06 19:33:57:476 +0000] [dd-task-scheduler] INFO datadog.trace.agent.core.StatusLogger - DATADOG TRACER CONFIGURATION {"version":"1.35.0~97065ed022","os_name":"Linux","os_version":"6.1.85-ts1-amd64","architecture":"amd64","lang":"jvm","lang_version":"21.0.2-tsjgss","jvm_vendor":"N/A","jvm_version":"21.0.2-tsjgss","java_class_version":"65.0","http_nonProxyHosts":"null","http_proxyHost":"null","enabled":true,"service":"hellojava","agent_url":"http://this-host-doesnt-exist.com:8126","agent_error":true,"debug":false,"trace_propagation_style_extract":["datadog","tracecontext"],"trace_propagation_style_inject":["datadog","tracecontext"],"analytics_enabled":false,"priority_sampling_enabled":true,"logs_correlation_enabled":true,"profiling_enabled":false,"remote_config_enabled":true,"debugger_enabled":false,"debugger_exception_enabled":false,"appsec_enabled":"ENABLED_INACTIVE","telemetry_enabled":true,"telemetry_dependency_collection_enabled":true,"telemetry_log_collection_enabled":false,"dd_version":"","health_checks_enabled":true,"configuration_file":"no config file present","runtime_id":"1f517376-f108-4012-a0ef-ea1940f110f3","logging_settings":{"levelInBrackets":false,"dateTimeFormat":"'[dd.trace 'yyyy-MM-dd HH:mm:ss:SSS Z']'","logFile":"System.err","configurationFile":"simplelogger.properties","showShortLogName":false,"showDateTime":true,"showLogName":true,"showThreadName":true,"defaultLogLevel":"INFO","warnLevelString":"WARN","embedException":false},"cws_enabled":false,"cws_tls_refresh":5000,"datadog_profiler_enabled":false,"datadog_profiler_safe":true,"datadog_profiler_enabled_overridden":false,"data_streams_enabled":false} [dd.trace 2024-09-06 19:34:01:688 +0000] [main] WARN datadog.telemetry.TelemetrySystem - Telemetry thread join was not completed

real 0m38.012s user 0m5.508s sys 0m0.353s


I did a network capture that shows the localhost sent back an RST immediately:

21 1.761860 127.0.0.1 → 127.0.0.1 TCP 74 40270 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=65495 SACK_PERM=1 TSval=2013823740 TSecr=0 WS=2048 22 1.761878 127.0.0.1 → 127.0.0.1 TCP 54 8126 → 40270 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0


But to the agent over the network, it went into a TCP retransmission loop (the host I am attempting to connect to lives behind several layers of switches and firewalls, and does not send back an RST):

251 16.514437 x.x.x.x → y.y.y.y TCP 74 44472 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238250723 TSecr=0 WS=2048 266 17.550506 x.x.x.x → y.y.y.y TCP 74 [TCP Retransmission] 44472 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238251760 TSecr=0 WS=2048 396 19.598486 x.x.x.x → y.y.y.y TCP 74 [TCP Retransmission] 44472 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238253808 TSecr=0 WS=2048 428 23.630480 x.x.x.x → y.y.y.y TCP 74 [TCP Retransmission] 44472 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238257840 TSecr=0 WS=2048 634 26.525267 x.x.x.x → y.y.y.y TCP 74 40424 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238260734 TSecr=0 WS=2048 645 27.534484 x.x.x.x → y.y.y.y TCP 74 [TCP Retransmission] 40424 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238261744 TSecr=0 WS=2048 707 29.582488 x.x.x.x → y.y.y.y TCP 74 [TCP Retransmission] 40424 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238263792 TSecr=0 WS=2048 742 33.614484 x.x.x.x → y.y.y.y TCP 74 [TCP Retransmission] 40424 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238267824 TSecr=0 WS=2048 766 36.527430 x.x.x.x → y.y.y.y TCP 74 33904 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238270736 TSecr=0 WS=2048 786 37.582481 x.x.x.x → y.y.y.y TCP 74 [TCP Retransmission] 33904 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238271792 TSecr=0 WS=2048 815 39.630481 x.x.x.x → y.y.y.y TCP 74 [TCP Retransmission] 33904 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238273840 TSecr=0 WS=2048 855 43.662508 x.x.x.x → y.y.y.y TCP 74 [TCP Retransmission] 33904 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238277872 TSecr=0 WS=2048 890 46.899895 x.x.x.x → y.y.y.y TCP 74 41110 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238281109 TSecr=0 WS=2048 893 47.086992 x.x.x.x → y.y.y.y TCP 74 41120 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238281296 TSecr=0 WS=2048 894 47.359935 x.x.x.x → y.y.y.y UDP 305 59395 → 8125 Len=263 904 47.950505 x.x.x.x → y.y.y.y TCP 74 [TCP Retransmission] 41110 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238282160 TSecr=0 WS=2048 926 48.766598 x.x.x.x → y.y.y.y UDP 247 59395 → 8125 Len=205 927 48.766633 x.x.x.x → y.y.y.y UDP 249 59395 → 8125 Len=207 928 48.767289 x.x.x.x → y.y.y.y UDP 268 59395 → 8125 Len=226 929 48.767567 x.x.x.x → y.y.y.y UDP 249 59395 → 8125 Len=207 930 48.768114 x.x.x.x → y.y.y.y UDP 259 59395 → 8125 Len=217 931 48.768306 x.x.x.x → y.y.y.y UDP 268 59395 → 8125 Len=226 932 48.768883 x.x.x.x → y.y.y.y UDP 267 59395 → 8125 Len=225 933 48.769061 x.x.x.x → y.y.y.y UDP 266 59395 → 8125 Len=224 934 48.769228 x.x.x.x → y.y.y.y UDP 264 59395 → 8125 Len=222 935 48.769381 x.x.x.x → y.y.y.y UDP 263 59395 → 8125 Len=221 936 48.769545 x.x.x.x → y.y.y.y UDP 243 59395 → 8125 Len=201 937 48.769713 x.x.x.x → y.y.y.y UDP 267 59395 → 8125 Len=225 938 48.769914 x.x.x.x → y.y.y.y UDP 245 59395 → 8125 Len=203 939 48.770073 x.x.x.x → y.y.y.y UDP 253 59395 → 8125 Len=211 940 48.770220 x.x.x.x → y.y.y.y UDP 241 59395 → 8125 Len=199 941 48.770366 x.x.x.x → y.y.y.y UDP 249 59395 → 8125 Len=207 942 48.770537 x.x.x.x → y.y.y.y UDP 257 59395 → 8125 Len=215 943 48.770692 x.x.x.x → y.y.y.y UDP 251 59395 → 8125 Len=209 944 48.770858 x.x.x.x → y.y.y.y UDP 248 59395 → 8125 Len=206 945 48.771013 x.x.x.x → y.y.y.y UDP 267 59395 → 8125 Len=225 946 48.771365 x.x.x.x → y.y.y.y UDP 257 59395 → 8125 Len=215 947 48.771559 x.x.x.x → y.y.y.y UDP 274 59395 → 8125 Len=232 948 48.771735 x.x.x.x → y.y.y.y UDP 272 59395 → 8125 Len=230 958 49.998482 x.x.x.x → y.y.y.y TCP 74 [TCP Retransmission] 41110 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238284208 TSecr=0 WS=2048



The [DEFAULT_AGENT_TIMEOUT ](https://github.com/DataDog/dd-trace-java/blob/master/dd-trace-api/src/main/java/datadog/trace/api/ConfigDefaults.java#L41) looks to be 10 seconds, so I'm not sure why this is taking 30 seconds to complete, other than the fact that the TCP retransmissions are adding to the delay. 

Can you please advise:
1. Why the tracer still attempts to initiate a connection to an agent if the environment variable is set to disable it
2. If we are setting our agent timeout incorrectly

Please let me know if you need any more information. Thank you!