DataDog / datadog-lambda-go

The Datadog AWS Lambda package for Go
Apache License 2.0
59 stars 40 forks source link

Time out to DD API after 10 seconds #39

Closed cfunkhouser closed 4 years ago

cfunkhouser commented 4 years ago

What does this PR do?

Set a ten second timeout on the http.Client used to communicate with the DataDog API.

Motivation

The Go HTTP Client specifies no timeout by default, which means - theoretically - a request could hang indefinitely under unfortunate server conditions. This is especially unfortunate given this library is designed to run in Lambda, where execution time is money.

I chose 10 seconds somewhat arbitrarily. I'm open to other values, or even to making it tunable. My hope is that this PR will start a conversation, more than just being accepted. I have a poor understanding of what the Datadog API latency guarantees are around the globe.

Testing Guidelines

I go tested it. I also used it on a test lambda internally. It works exactly as expected under non-pathological cases.

Additional Notes

Types of changes

Checklist

codecov-commenter commented 4 years ago

Codecov Report

Merging #39 into master will decrease coverage by 0.33%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #39      +/-   ##
==========================================
- Coverage   89.16%   88.83%   -0.34%     
==========================================
  Files          12       12              
  Lines         572      573       +1     
==========================================
- Hits          510      509       -1     
- Misses         46       48       +2     
  Partials       16       16              
Impacted Files Coverage Δ
internal/metrics/api.go 90.00% <100.00%> (+0.16%) :arrow_up:
internal/metrics/processor.go 94.44% <0.00%> (-2.78%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2dd03ad...6e22a7e. Read the comment docs.

agocs commented 4 years ago

Oh hey Funk! Good catch. I'll start figuring out how we want to handle API timeouts internally.

This is partially why we're pushing usage of the DD_FLUSH_TO_LOG option, wherein the metrics are written to CloudWatch logs ~instantly, and are picked up by the AWS Forwarder lambda: https://github.com/DataDog/datadog-serverless-functions/tree/master/aws/logs_monitoring

cfunkhouser commented 4 years ago

For sure! However, in some cases we want to avoid CloudWatch because of the added latency primarily, and additional costs secondarily.

agocs commented 4 years ago

Potentially concerning: in internal/metrics/processor.go we implement some retry logic on sending metrics. If the Connection timed out and the right flags were set, we might spend up to 15 seconds trying to send metrics. Better than forever, but not ideal. Fortunately shouldRetryOnFail is set to false by default. I'll add a few tasks: