fluxcd / flagger

Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments)
https://docs.flagger.app
Apache License 2.0
4.9k stars 732 forks source link

Datadog query returns unexpected nil values: Enable Datadog API call and response in debug logs #1217

Open jonnylangefeld opened 2 years ago

jonnylangefeld commented 2 years ago

Describe the bug

We are running flagger as central service used by multiple other services for progressive deliveries. The services use Datadog queries via the MetricsTemplate. The other day we had an outage where only one of the services could not progress because it only received nil values in the Datadog response:

{"level":"info","ts":"2022-06-06T17:14:07.153Z","caller":"controller/events.go:45","msg":"Halt service.namespace advancement slo 0.00 < 90","canary":"service.namespace"}

This same canary has worked flawless in the past. During this incident several retries for the rollout did not succeed. A restart of the flagger pod and another retry for the rollout resolved the issue for now.

We queried the same query that flagger receives via the MetricsTemplate through the Datadog UI and also as a curl command (to closer replicate what flagger should be doing behind the scenes) with the approximate timestamps of the progressive delivery and in both cases the metrics had legitimate values in the Datadog response that were above 90, so the delivery should have progressed.

Since at this point it's unclear where the issue happened (FWIW it could be a faulty datadog query that has unexpected responsed based on the timestamps flagger uses) we just suggest to enable debug output in flagger to print out the datadog query including all headers (maybe hide the token) and the response. Only this would help us to reproduce the exact query flagger sends under the hood with the exact timestamps to further debug why flagger did not receive any data from Datadog. This will be debug output only so it won't affect any other operations.

To Reproduce

So far we have not been able to reproduce the issue, hence the suggestion to add debug logs for the Datadog query and the Datadog response.

Expected behavior

The delivery should have progressed because the values we observed via the Datadog UI and API were all within good ranges (above 90).

Additional context

stefanprodan commented 2 years ago

@jonnylangefeld I’m Ok with adding debug logs as long as we mask the token. Would you like to contribute this?

ccystephenclinton commented 4 months ago

@jonnylangefeld did you ever get to the bottom of this issue? I'm still seeing this issue now on queries that use default_zero in DD. The problem is, if you don't set that, you get 'No Data'.