capactio / capact

Simple way to manage applications and infrastructure.
https://capact.io
Apache License 2.0
80 stars 19 forks source link

Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+ #679

Closed pkosiec closed 2 years ago

pkosiec commented 2 years ago

Description

Changes proposed in this pull request:

Notes

Currently, our E2E tests ignore the pods with reason "Shutdown", because of the graceful node shutdown feature. It totally makes sense, because in the node shutdown manager from K8s 1.21 the reason is indeed the one: https://github.com/kubernetes/kubernetes/blob/v1.21.6/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L40-L42

Since 15.03.2022 we started to observe Pod failures on our long-running cluster:

Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.6-gke.1503", GitCommit:"2c7bbda09a9b7ca78db230e099cf90fe901d3df8", GitTreeState:"clean", BuildDate:"2022-02-18T03:17:45Z", GoVersion:"go1.16.9b7", Compiler:"gc", Platform:"linux/amd64"}

the Pods have the following statuses:

Status:         Failed
Reason:         NodeShutdown
Message:        Pod was rejected as the node is shutting down.

or

Status:         Failed
Reason:         Terminated
Message:        Pod was terminated in response to imminent node shutdown.

They are also related to the graceful node shutdown, and it is totally normal behavior:

Screenshot 2022-03-18 at 10 21 08

However, what's really weird, is that these reasons and messages are the "newer" ones - they changed since 1.22: https://github.com/kubernetes/kubernetes/blob/v1.22.0/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L39-L42 - PR which introduces that: https://github.com/kubernetes/kubernetes/pull/102840

Again, see the code and docs from 1.21:

I couldn't see anything useful in GCP GKE release notes and in GCP GKE issue tracker.

I stopped the short investigation at this point, as:

Testing

I tested the PR against our GCP cluster:

  1. Authenticate to the cluster (gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${REGION})
  2. Add _ "k8s.io/client-go/plugin/pkg/client/auth/gcp" import in test/e2e/cluster_test.go
  3. (Optional) Comment line 41 (fields.OneTermNotEqualSelector("metadata.namespace", "kube-system"),), to run this test also against kube-system, where we have such Pods.
    • UPDATE: I disabled periodic tests for a while, as we can notice such terminated pods in the capact-system NS. You don't need to comment this line.
  4. Run the Cluster check test, see that it passes

You can run this test again to see what happens with previous check - comment out the lines 110-115:

    if strings.EqualFold(status.Reason, nodeShutdownReason) && strings.EqualFold(status.Message, nodeShutdownMessage) {
        return true
    }
    if strings.EqualFold(status.Reason, nodeShutdownNotAdmittedReason) {
        return true
    }

and run it. You'll see that it fails with the same messages as https://github.com/capactio/capact/runs/5595654956?check_suite_focus=true