ECS Fargate Deployment
2GB, 1vCPU task
Running datadog-agent as a sidecar, version LATEST
Java service details
dd-trace-java version:1.39.0
jre- openjdk17
Service running quartz scheduled jobs, quartz DB running RDS mysql node.
Problem:
With otel enabled (-Ddd.trace.otel.enabled=True), the service started reporting malformed responses from the quartz db. To debug, I ECS exec'd into container and ran packet capture. The packet capture revealed that datadog trace headers were being sent over port 3306 every microsecond.
Other than the malformed response, this also resulted in a 10 fold increase in network traffic and the service frequently pinning the cpu. There is also some evidence that this was causing nic saturation and causing problem with other tcp traffic; capture noted TCP Window Full and ZeroWindow'ing senders. The FullWindow/ZeroWindow and malformed packet were all coincident leading to the conclusion that this increased network load caused the fatal malformed response from RDS (service unable to get quartz jobs).
We turned off otel as we were not relying on too many metrics, but I am interested in if we misconfigured it or if there is, in fact, an issue. I would also liked to understand more about activity over port 3306; reserved port restricted for mysql. In general, is this context propagation or metrics retrieval.
Configuration:
ECS Fargate Deployment 2GB, 1vCPU task Running datadog-agent as a sidecar, version LATEST
Java service details dd-trace-java version:1.39.0 jre- openjdk17 Service running quartz scheduled jobs, quartz DB running RDS mysql node.
Problem: With otel enabled (-Ddd.trace.otel.enabled=True), the service started reporting malformed responses from the quartz db. To debug, I ECS exec'd into container and ran packet capture. The packet capture revealed that datadog trace headers were being sent over port 3306 every microsecond.
Excerpt from that capture:
Packet contents:
Other than the malformed response, this also resulted in a 10 fold increase in network traffic and the service frequently pinning the cpu. There is also some evidence that this was causing nic saturation and causing problem with other tcp traffic; capture noted TCP Window Full and ZeroWindow'ing senders. The FullWindow/ZeroWindow and malformed packet were all coincident leading to the conclusion that this increased network load caused the fatal malformed response from RDS (service unable to get quartz jobs).
We turned off otel as we were not relying on too many metrics, but I am interested in if we misconfigured it or if there is, in fact, an issue. I would also liked to understand more about activity over port 3306; reserved port restricted for mysql. In general, is this context propagation or metrics retrieval.