Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+

Description

Changes proposed in this pull request:

Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+

Notes

Currently, our E2E tests ignore the pods with reason "Shutdown", because of the graceful node shutdown feature. It totally makes sense, because in the node shutdown manager from K8s 1.21 the reason is indeed the one: https://github.com/kubernetes/kubernetes/blob/v1.21.6/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L40-L42

Since 15.03.2022 we started to observe Pod failures on our long-running cluster:

Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.6-gke.1503", GitCommit:"2c7bbda09a9b7ca78db230e099cf90fe901d3df8", GitTreeState:"clean", BuildDate:"2022-02-18T03:17:45Z", GoVersion:"go1.16.9b7", Compiler:"gc", Platform:"linux/amd64"}

the Pods have the following statuses:

Status:         Failed
Reason:         NodeShutdown
Message:        Pod was rejected as the node is shutting down.

Status:         Failed
Reason:         Terminated
Message:        Pod was terminated in response to imminent node shutdown.

They are also related to the graceful node shutdown, and it is totally normal behavior:

Screenshot 2022-03-18 at 10 21 08

However, what's really weird, is that these reasons and messages are the "newer" ones - they changed since 1.22: https://github.com/kubernetes/kubernetes/blob/v1.22.0/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L39-L42 - PR which introduces that: https://github.com/kubernetes/kubernetes/pull/102840

Again, see the code and docs from 1.21:

I couldn't see anything useful in GCP GKE release notes and in GCP GKE issue tracker.

I stopped the short investigation at this point, as:

~I want to create an issue in the GCP issue tracker.~ Issue created: https://issuetracker.google.com/issues/225186478
This change will be useful anyway once we migrate to K8s 1.22+: #611
I don't think this has high priority to continue it by our own, apart from creating the issue. While it would be worth to know "why", at least we know "what".
- If we want to continue it, I can create a dedicated task for that. Idea: We can to reconfigure our node pool, enable SSH, connect to it and see the kubelet configuration - but I'm not sure if that's possible. Even if it is, I'm not sure if we find anything useful.

Testing

I tested the PR against our GCP cluster:

Authenticate to the cluster (gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${REGION})
Add _ "k8s.io/client-go/plugin/pkg/client/auth/gcp" import in test/e2e/cluster_test.go
(Optional) Comment line 41 (fields.OneTermNotEqualSelector("metadata.namespace", "kube-system"),), to run this test also against kube-system, where we have such Pods.
- UPDATE: I disabled periodic tests for a while, as we can notice such terminated pods in the capact-system NS. You don't need to comment this line.
Run the Cluster check test, see that it passes

You can run this test again to see what happens with previous check - comment out the lines 110-115:

    if strings.EqualFold(status.Reason, nodeShutdownReason) && strings.EqualFold(status.Message, nodeShutdownMessage) {
        return true
    }
    if strings.EqualFold(status.Reason, nodeShutdownNotAdmittedReason) {
        return true
    }

and run it. You'll see that it fails with the same messages as https://github.com/capactio/capact/runs/5595654956?check_suite_focus=true

capactio / capact

Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+ #679

Description

Notes

Testing