This change will be useful anyway once we migrate to K8s 1.22+: #611
I don't think this has high priority to continue it by our own, apart from creating the issue. While it would be worth to know "why", at least we know "what".
If we want to continue it, I can create a dedicated task for that. Idea: We can to reconfigure our node pool, enable SSH, connect to it and see the kubelet configuration - but I'm not sure if that's possible. Even if it is, I'm not sure if we find anything useful.
Testing
I tested the PR against our GCP cluster:
Authenticate to the cluster (gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${REGION})
Add _ "k8s.io/client-go/plugin/pkg/client/auth/gcp" import in test/e2e/cluster_test.go
(Optional) Comment line 41 (fields.OneTermNotEqualSelector("metadata.namespace", "kube-system"),), to run this test also against kube-system, where we have such Pods.
UPDATE: I disabled periodic tests for a while, as we can notice such terminated pods in the capact-system NS. You don't need to comment this line.
Run the Cluster check test, see that it passes
You can run this test again to see what happens with previous check - comment out the lines 110-115:
if strings.EqualFold(status.Reason, nodeShutdownReason) && strings.EqualFold(status.Message, nodeShutdownMessage) {
return true
}
if strings.EqualFold(status.Reason, nodeShutdownNotAdmittedReason) {
return true
}
Description
Changes proposed in this pull request:
Notes
Currently, our E2E tests ignore the pods with reason "Shutdown", because of the graceful node shutdown feature. It totally makes sense, because in the node shutdown manager from K8s 1.21 the reason is indeed the one: https://github.com/kubernetes/kubernetes/blob/v1.21.6/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L40-L42
Since 15.03.2022 we started to observe Pod failures on our long-running cluster:
the Pods have the following statuses:
or
They are also related to the graceful node shutdown, and it is totally normal behavior:
However, what's really weird, is that these reasons and messages are the "newer" ones - they changed since 1.22: https://github.com/kubernetes/kubernetes/blob/v1.22.0/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L39-L42 - PR which introduces that: https://github.com/kubernetes/kubernetes/pull/102840
Again, see the code and docs from 1.21:
I couldn't see anything useful in GCP GKE release notes and in GCP GKE issue tracker.
I stopped the short investigation at this point, as:
Testing
I tested the PR against our GCP cluster:
gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${REGION}
)_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
import intest/e2e/cluster_test.go
fields.OneTermNotEqualSelector("metadata.namespace", "kube-system"),
), to run this test also againstkube-system
, where we have such Pods.capact-system
NS. You don't need to comment this line.Cluster check
test, see that it passesYou can run this test again to see what happens with previous check - comment out the lines 110-115:
and run it. You'll see that it fails with the same messages as https://github.com/capactio/capact/runs/5595654956?check_suite_focus=true