Open jonnylangefeld opened 2 years ago
@jonnylangefeld I’m Ok with adding debug logs as long as we mask the token. Would you like to contribute this?
@jonnylangefeld did you ever get to the bottom of this issue? I'm still seeing this issue now on queries that use default_zero
in DD. The problem is, if you don't set that, you get 'No Data'.
Describe the bug
We are running flagger as central service used by multiple other services for progressive deliveries. The services use Datadog queries via the
MetricsTemplate
. The other day we had an outage where only one of the services could not progress because it only received nil values in the Datadog response:This same canary has worked flawless in the past. During this incident several retries for the rollout did not succeed. A restart of the flagger pod and another retry for the rollout resolved the issue for now.
We queried the same query that flagger receives via the
MetricsTemplate
through the Datadog UI and also as acurl
command (to closer replicate what flagger should be doing behind the scenes) with the approximate timestamps of the progressive delivery and in both cases the metrics had legitimate values in the Datadog response that were above 90, so the delivery should have progressed.Since at this point it's unclear where the issue happened (FWIW it could be a faulty datadog query that has unexpected responsed based on the timestamps flagger uses) we just suggest to enable debug output in flagger to print out the datadog query including all headers (maybe hide the token) and the response. Only this would help us to reproduce the exact query flagger sends under the hood with the exact timestamps to further debug why flagger did not receive any data from Datadog. This will be debug output only so it won't affect any other operations.
To Reproduce
So far we have not been able to reproduce the issue, hence the suggestion to add debug logs for the Datadog query and the Datadog response.
Expected behavior
The delivery should have progressed because the values we observed via the Datadog UI and API were all within good ranges (above 90).
Additional context
1.15.0
1.21.11