DataDog / datadog-lambda-go

The Datadog AWS Lambda package for Go
Apache License 2.0
59 stars 40 forks source link

Retry sending trace payloads on failure. #128

Closed purple4reina closed 1 year ago

purple4reina commented 1 year ago

What does this PR do?

When sending traces to the extension fails, retry up to 2 times.

Motivation

In a very small percentage of cases for high throughput apps, traces are unsuccessfully sent to the extension. We're seeing errors like

2022/12/19 21:09:41 Datadog Tracer v1.45.1 ERROR: lost 1 traces: Post "http://localhost:8126/v0.4/traces": read tcp 127.0.0.1:44108->127.0.0.1:8126: read: connection reset by peer ([send duration: 0.327196ms]) (occurred: 19 Dec 22 21:07 UTC)

2022/12/19 21:15:44 Datadog Tracer v1.45.1 ERROR: lost 1 traces: Post "http://localhost:8126/v0.4/traces": write tcp 127.0.0.1:45932->127.0.0.1:8126: write: broken pipe ([send duration: 0.225527ms]) (occurred: 19 Dec 22 21:14 UTC)

2022/12/19 21:17:54 Datadog Tracer v1.45.1 ERROR: lost 1 traces: Post "http://localhost:8126/v0.4/traces": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ([send duration: 19.249s]) (occurred: 19 Dec 22 21:14 UTC)

While increasing the timeout helps, you can see how some failures happen before the timeout is hit. This is because the datadog lambda extension has been paused in the middle of the request. When this is done, the connection is abruptly closed.

Therefore, this pull request allows the tracer to retry sending the trace at its next earliest convenience.

Testing Guidelines

Additional Notes

See https://github.com/DataDog/dd-trace-go/pull/1636 for corresponding change in the go tracer.

Types of changes

Checklist