DataDog / extendeddaemonset

Kubernetes Extended Daemonset controller
Apache License 2.0
98 stars 13 forks source link

Fail check-eds if canary is marked as failed #170

Closed sblumenthal closed 11 months ago

sblumenthal commented 11 months ago

What does this PR do?

This PR checks the conditions within the EDS status field for the presence of the "Canary-Failed" condition, and compares its timestamp to the timestamp of the active replicaset to determine if the EDS canary failed or not

Motivation

An agent version was deployed to staging that caused the canary to automatically fail due to a high number of restarts, but the check did not look at the canary status, so when desired replicas matched running replicas, the check succeeded, and we were none the wiser.

Additional Notes

While testing the behavior, I noticed that the EDS object does not denote it has failed in any other way other than the conditions field. Here is the status field of an EDS object which had just failed:

status:
  activeReplicaSet: datadog-agent-tt9z6
  available: 30
  conditions:
    - lastTransitionTime: '2023-11-22T19:29:15Z'
      lastUpdateTime: '2023-11-22T19:29:15Z'
      message: 'canary failed with ers: datadog-agent-8n26h'
      reason: CanaryFailed
      status: 'False'
      type: Canary-Failed
    - lastTransitionTime: '2023-11-15T20:14:07Z'
      lastUpdateTime: '2023-11-15T20:14:07Z'
      message: 'canary paused with ers: datadog-agent-425pw'
      reason: CreateContainerConfigError
      status: 'False'
      type: Canary-Paused
  current: 33
  desired: 33
  ignoredUnresponsiveNodes: 0
  ready: 30
  state: Running
  upToDate: 23

Based on this example, I built the check to look at the conditions slice for the Canary-Failed type

Describe your test plan

I have already deployed this change to a staging cluster, and have manually failed a deployment and was able to verify that the check failed with: Error: active canary has a creation timestamp before the last CanaryFailed condition, meaning the deployment failed

Before change

After change

The deployment values can be seen here

codecov-commenter commented 11 months ago

Codecov Report

Merging #170 (28f728b) into main (8a74725) will increase coverage by 0.12%. The diff coverage is n/a.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/DataDog/extendeddaemonset/pull/170/graphs/tree.svg?width=650&height=150&src=pr&token=i8XjxcCWUi&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog)](https://app.codecov.io/gh/DataDog/extendeddaemonset/pull/170?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog) ```diff @@ Coverage Diff @@ ## main #170 +/- ## ========================================== + Coverage 63.05% 63.18% +0.12% ========================================== Files 41 41 Lines 3094 3094 ========================================== + Hits 1951 1955 +4 + Misses 1023 1020 -3 + Partials 120 119 -1 ``` | [Flag](https://app.codecov.io/gh/DataDog/extendeddaemonset/pull/170/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/DataDog/extendeddaemonset/pull/170/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog) | `63.18% <ø> (+0.12%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog#carryforward-flags-in-the-pull-request-comment) to find out more. [see 1 file with indirect coverage changes](https://app.codecov.io/gh/DataDog/extendeddaemonset/pull/170/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog) ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/DataDog/extendeddaemonset/pull/170?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/DataDog/extendeddaemonset/pull/170?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog). Last update [8a74725...28f728b](https://app.codecov.io/gh/DataDog/extendeddaemonset/pull/170?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog).