FoundationDB / fdb-kubernetes-operator

A kubernetes operator for FoundationDB
Apache License 2.0
241 stars 82 forks source link

nodeTaints not detected while using taintReplacementOptions to rotate FDB cluster pods #2091

Closed kky-fury closed 2 months ago

kky-fury commented 3 months ago

What happened?

We were experimenting with using taintReplacementOptions for rotating the pods of our FDB cluster onto new nodes, while upgrading our Kubernetes version. However, after applying the taints onto the nodes the taints were not detected by the operator.

The operator logs showed the following error:

level":"error","msg":"pkg/mod/k8s.io/client-go@v0.26.10/tools/cache/reflector.go:169: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User \"system:serviceaccount:infra:fdb-operator\" cannot list resource \"nodes\" in API group \"\" at the cluster scope\n"

What did you expect to happen?

The taints on the nodes to be detected and the fdb-operator to automatically delete and reschedule the coordinator, log, stateless, and storage pods onto new nodes.

How can we reproduce it (as minimally and precisely as possible)?

Tainting the nodes running the FDB cluster pods with something similar to below:

from kubernetes import client
client.V1Taint(
            key="foo/bar",
            value="fdbrotation",
            effect="PreferNoSchedule"
    )

Patching the FDB cluster spec with something like below:

 "spec": {
            "automationOptions": {
                "replacements": {
                    "taintReplacementOptions": [
                        {
                            "key": "foo/bar",
                            "durationInSeconds": 300
                        }
                    ],
                    "taintReplacementTimeSeconds": 60,
                    "enabled": True
                }
            }
        }

Anything else we need to know?

We added the required permissions to the RBAC role for the resources nodes and it fixed the issue.

Changes

We would like to merge to main if these changes are acceptable.

FDB Kubernetes operator

FDB-operator version: 1.33.0

Kubernetes version

K8s version: 1.27.12

Cloud provider

AWS, EKS

johscheuer commented 3 months ago

Hello 👋

Could you please verify if you have a ClusterRole similar to this one: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/config/samples/deployment.yaml#L6-L19 for your operator deployment? The error that you copied says that the operator is not allowed to list nodes (and therefore cannot check the taints). If the ClusterRole exists, you have to make sure that there is a ClusterRoleBinding for your service account, similar to: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/config/samples/deployment.yaml#L141-L152

kky-fury commented 3 months ago

Hello,

Thank you for your reply. Yes, we did not have that before but added it to make it work.

Is there any plan to add it to the official helm chart?

johscheuer commented 3 months ago

We don't maintain the helm-charts actively as they were contributed by the community (see: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/README.md#using-helm). If you have the time to add it to the helm-charts and open a PR, that would be appreciated :)

kky-fury commented 3 months ago

I created one, please take a look #2093.