FoundationDB / fdb-kubernetes-operator

A kubernetes operator for FoundationDB
Apache License 2.0
240 stars 83 forks source link

Coordinators is all over the place when changing redundancy from double to triple #1167

Closed TSTTang closed 2 years ago

TSTTang commented 2 years ago

What happened?

Hi, I have this cluster setup: services: headless: true coordinatorSelectionSetting: processClass: storage priority: 1 databaseConfiguration: redundancy_mode: double storage_engine: ssd storage: 5 processCounts: stateless: 4 cluster_controller: 1 logs: 4 proxy: 3

The cluster is healthy, the coordinators are among the storage pods: Coordination servers: 172.17.104.202:4500:tls (reachable) 172.17.111.119:4500:tls (reachable) 172.17.127.249:4500:tls (reachable)

fdb-kubernetes-operator-controller-manager-8889b5df6-cv2zk 1/1 Running 0 35m 172.17.111.115 10.240.129.16 sample-cluster-cluster-controller-1 2/2 Running 0 9m5s 172.17.119.17 10.240.1.12 sample-cluster-log-1 2/2 Running 0 9m6s 172.17.88.101 10.240.1.16 sample-cluster-log-2 2/2 Running 0 9m6s 172.17.112.203 10.240.1.14 sample-cluster-log-3 2/2 Running 0 9m5s 172.17.122.144 10.240.65.16 sample-cluster-log-4 2/2 Running 0 9m5s 172.17.108.205 10.240.129.14 sample-cluster-proxy-1 2/2 Running 0 9m6s 172.17.99.148 10.240.65.17 sample-cluster-proxy-2 2/2 Running 0 9m6s 172.17.85.13 10.240.129.17 sample-cluster-proxy-3 2/2 Running 0 9m6s 172.17.85.81 10.240.1.15 sample-cluster-stateless-1 2/2 Running 0 9m6s 172.17.69.207 10.240.65.15 sample-cluster-stateless-2 2/2 Running 0 9m6s 172.17.107.203 10.240.129.13 sample-cluster-stateless-3 2/2 Running 0 9m6s 172.17.88.13 10.240.1.13 sample-cluster-stateless-4 2/2 Running 0 9m6s 172.17.127.250 10.240.65.14 sample-cluster-storage-1 2/2 Running 0 9m7s 172.17.127.249 10.240.65.14 sample-cluster-storage-2 2/2 Running 0 9m7s 172.17.111.119 10.240.129.16 sample-cluster-storage-3 2/2 Running 0 9m6s 172.17.104.202 10.240.65.13 sample-cluster-storage-4 2/2 Running 0 9m6s 172.17.70.75 10.240.129.15 sample-cluster-storage-5 2/2 Running 0 9m6s 172.17.112.204 10.240.1.14 The coordinators is on storage-1, storage-2 and storage-3 Now, I modify the cluster spec file and change the redundancy mode triple and do an apply and the new coordinators are like this: Coordination servers: 172.17.88.101:4500:tls (reachable) 172.17.108.205:4500:tls (reachable) 172.17.112.203:4500:tls (reachable) 172.17.122.144:4500:tls (reachable) 172.17.127.249:4500:tls (reachable) Now the coordinators on log-1, log-2,log-3, log-4, storage-1 Why are the coordinators all get scrambled to the log and not on the storage? No new pods are added...

What did you expect to happen?

Coordinators should be assigned from the pool of storage pods

How can we reproduce it (as minimally and precisely as possible)?

create a cluster with 5 storage and some other pods, set it to double redundancy and change that to triple once the cluster is up.

Anything else we need to know?

No response

FDB Kubernetes operator

```console $ kubectl fdb version foundationdb-operator: 0.48.0 kubectl-fdb: v0.48.0 ```

Kubernetes version

```console $ kubectl version kubectl version Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v0.21.0-beta.1", GitCommit:"96e95cef877ba04872b88e4e2597eabb0174d182", GitTreeState:"clean", BuildDate:"2021-09-10T10:44:50Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5+5c84e52", GitCommit:"ce18cbe56f6e88a8fc0e06366afe113b415ad39b", GitTreeState:"clean", BuildDate:"2022-03-01T18:44:38Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"} WARNING: version difference between client (0.21) and server (1.22) exceeds the supported minor version skew of +/-1 ```

Cloud provider

IBM Cloud openshift cluster
johscheuer commented 2 years ago

I see you're setting is actually wrong:

coordinatorSelectionSetting:
  processClass: storage
  priority: 1

must be:

coordinatorSelection:
- processClass: storage
  priority: 1

So from the operator point of view you didn't set a coordinatorSelection and all stateful processes are eligible. Could you repeat you test with the corrected setting? I actually would have expected that the Kubernetes API complains about this.

johscheuer commented 2 years ago

Could you confirm if this solved you issue?

TSTTang commented 2 years ago

Yes, that works, now after changing the redundancy from double to triple, the new coordinator list are still within the storage pod list. So, this issue can be closed.