When sending traces to the extension fails, retry up to 2 times.
Motivation
In a very small percentage of cases for high throughput apps, traces are unsuccessfully sent to the extension. We're seeing errors like
2022/12/19 21:09:41 Datadog Tracer v1.45.1 ERROR: lost 1 traces: Post "http://localhost:8126/v0.4/traces": read tcp 127.0.0.1:44108->127.0.0.1:8126: read: connection reset by peer ([send duration: 0.327196ms]) (occurred: 19 Dec 22 21:07 UTC)
2022/12/19 21:15:44 Datadog Tracer v1.45.1 ERROR: lost 1 traces: Post "http://localhost:8126/v0.4/traces": write tcp 127.0.0.1:45932->127.0.0.1:8126: write: broken pipe ([send duration: 0.225527ms]) (occurred: 19 Dec 22 21:14 UTC)
2022/12/19 21:17:54 Datadog Tracer v1.45.1 ERROR: lost 1 traces: Post "http://localhost:8126/v0.4/traces": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ([send duration: 19.249s]) (occurred: 19 Dec 22 21:14 UTC)
While increasing the timeout helps, you can see how some failures happen before the timeout is hit. This is because the datadog lambda extension has been paused in the middle of the request. When this is done, the connection is abruptly closed.
Therefore, this pull request allows the tracer to retry sending the trace at its next earliest convenience.
What does this PR do?
When sending traces to the extension fails, retry up to 2 times.
Motivation
In a very small percentage of cases for high throughput apps, traces are unsuccessfully sent to the extension. We're seeing errors like
While increasing the timeout helps, you can see how some failures happen before the timeout is hit. This is because the datadog lambda extension has been paused in the middle of the request. When this is done, the connection is abruptly closed.
Therefore, this pull request allows the tracer to retry sending the trace at its next earliest convenience.
Testing Guidelines
Additional Notes
See https://github.com/DataDog/dd-trace-go/pull/1636 for corresponding change in the go tracer.
Types of changes
Checklist