three_data_hall inconsistent connection string problem

gm42 commented 1 week ago

What happened?

I have found a quirky issue happening when using three_data_hall; the issue consists in the k8s FoundationDB cluster status containing an outdated connection string which does not correspond to what the cluster is currently using:

A:  connectionString:     mydb:dZiqyRWzkuM8tZ6hHx4Z3xWsFdu4BV0F@10.10.2.18:4501,10.20.22.84:4501,10.20.60.215:4501,10.20.67.98:4501,10.20.81.100:4501,10.10.265.72:4501,10.20.217.63:4501,10.20.240.122:4501,10.20.243.153:4501
A:  seedConnectionString: <none>

B:  seedConnectionString: mydb:mCYmFHPkEfyKO5gl8ae5ocZPq2a0IJW4@10.10.20.248:4501,10.20.22.192:4501,10.20.67.101:4501,10.10.210.33:4501,10.10.241.115:4501,10.10.268.53:4501,10.10.277.35:4501,10.10.285.107:4501,10.20.220.3:4501
B:  connectionString: mydb:9pION4FdMvW53gB4ikdjo61cs7HQKtK3@10.20.22.84:4501,10.20.60.215:4501,10.20.67.98:4501,10.20.81.100:4501,10.10.220.14:4501,10.10.265.72:4501,10.20.200.1:4501,10.20.217.63:4501,10.20.243.153:4501

C:  seedConnectionString: mydb:mCYmFHPkEfyKO5gl8ae5ocZPq2a0IJW4@10.10.20.248:4501,10.20.22.192:4501,10.20.67.101:4501,10.10.210.33:4501,10.10.241.115:4501,10.10.268.53:4501,10.10.277.35:4501,10.10.285.107:4501,10.20.220.3:4501
C:  connectionString: mydb:dZiqyRWzkuM8tZ6hHx4Z3xWsFdu4BV0F@10.10.2.18:4501,10.20.22.84:4501,10.20.60.215:4501,10.20.67.98:4501,10.20.81.100:4501,10.10.265.72:4501,10.20.217.63:4501,10.20.240.122:4501,10.20.243.153:4501

When checking directly via get \xff\xff/connection_string, the correct connection string for the cluster appears to be the one with generation ID 9pION4FdMvW53gB4ikdjo61cs7HQKtK3.

What did you expect to happen?

The operator-maintained connection string field should always match get \xff\xff/connection_string.

How can we reproduce it (as minimally and precisely as possible)?

trigger pod rotation of some pods in data hall A; for example coordinators, by changing something in the podTemplate for coordinators
repeat for for halls B and C until issue is manifest

Anything else we need to know?

Related: #1958

By my analysis the client issues are symptoms and not the cause: there is always going to be a client with an incorrect connection string if the operator-provided configmap does not contain the same connection string for all halls.

FDB Kubernetes operator

v1.47.0

Kubernetes version

```console $ kubectl version Client Version: v1.29.3 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.4-eks-a737599 ```

Cloud provider

johscheuer commented 1 week ago

This is a known issue when multiple operator instances are used (like in the multi-region setup or in three-data-hall). I assume you used the split image as this is the default. In the newer unified image we added a propagation mechanism to reduce the time between an out-dated connection string. The TLDR here is that the fdb-kubernetes-monitor updates an annotation when the local cluster file changes (which is the case when the connection string is updated). The annotation change will then trigger a new reconciliation loop and the operator will update the FoundationDBCluster resource and the ConfigMap.

I change the label to question as we fixed it in the unified image and we have no plans to fix it for the split image.

gm42 commented 1 week ago

Thanks for the explanation; is the unified image considered stable by now?

In the newer unified image we added a propagation mechanism to reduce the time between an out-dated connection string.

I am indeed using the default split image; however the issue never resolves by itself (it's not a matter of waiting longer, stays intact also for days) unless I manually fix it or trigger further rotations.

johscheuer commented 1 week ago

Thanks for the explanation; is the unified image considered stable by now?

It's considered stable and can be used in production, but please test it first in your dev/test environment. If anything doesn't work as expected please create a GitHub issue for it.

I am indeed using the default split image; however the issue never resolves by itself (it's not a matter of waiting longer, stays intact also for days) unless I manually fix it or trigger further rotations.

Interesting, I would have thought that after some time (I think the default reconciliation period is 10h) the connection string will be updated by the operator. Has the operator reconciled in that time? I would expect it didn't execute another reconciliation loop in this case.

gm42 commented 1 week ago

Has the operator reconciled in that time? I would expect it didn't execute another reconciliation loop in this case.

IIRC operator reports in logs that nothing to reconcile was found, so it does not take any action; I'll reproduce the issue and copy/paste here relevant logs, in case it's of interest for a related bug. I'll then keep this test cluster running, in case you want me to check something else while it's affected.

johscheuer commented 1 week ago

The operator should notice that the connection strings are not matching: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/v1.47.0/controllers/cluster_controller.go#L580-L586 and updates the connection string in the FoundationDBCluster resource and then the updateStatus reconciler will update the FoundationDBCluster resource status: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/v1.47.0/controllers/update_status.go#L254-L263.

gm42 commented 1 week ago

I am indeed running that version of the operator; perhaps there is a cached status issue at play? I'll come back here once I reproduce it.

gm42 commented 1 week ago

I think that operator might have detected the problem but was desisting to do any change because cluster was unhealthy, which in turn was caused by clients using the incorrect connection string (depending on which of the 3 configmaps they are using); will report back as soon as I can confirm this.

gm42 commented 1 week ago

While I wait to reproduce this issue anew, I have been looking at the logs I have about the last time this issue happened. Both those 2 theories I mentioned seem unfounded; does this perhaps shed some light?

e.g. is this evidence that operator is updating connection strings in some sort of loop?

johscheuer commented 3 days ago

I am indeed running that version of the operator; perhaps there is a cached status issue at play? I'll come back here once I reproduce it.

The cached status shouldn't be an issue because the status is only cached for one reconciliation.

While I wait to reproduce this issue anew, I have been looking at the logs I have about the last time this issue happened. Both those 2 theories I mentioned seem unfounded; does this perhaps shed some light?

Based one those logs it seems like the different coordinator instances have changed the coordinators? Might be worth to check for those logs: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/change_coordinators.go#L83 during the same timeframe.

FoundationDB / fdb-kubernetes-operator