Open gm42 opened 1 week ago
This is a known issue when multiple operator instances are used (like in the multi-region setup or in three-data-hall). I assume you used the split image as this is the default. In the newer unified image we added a propagation mechanism to reduce the time between an out-dated connection string. The TLDR here is that the fdb-kubernetes-monitor updates an annotation when the local cluster file changes (which is the case when the connection string is updated). The annotation change will then trigger a new reconciliation loop and the operator will update the FoundationDBCluster
resource and the ConfigMap
.
I change the label to question
as we fixed it in the unified
image and we have no plans to fix it for the split
image.
Thanks for the explanation; is the unified
image considered stable by now?
In the newer unified image we added a propagation mechanism to reduce the time between an out-dated connection string.
I am indeed using the default split image
; however the issue never resolves by itself (it's not a matter of waiting longer, stays intact also for days) unless I manually fix it or trigger further rotations.
Thanks for the explanation; is the unified image considered stable by now?
It's considered stable and can be used in production, but please test it first in your dev/test environment. If anything doesn't work as expected please create a GitHub issue for it.
I am indeed using the default split image; however the issue never resolves by itself (it's not a matter of waiting longer, stays intact also for days) unless I manually fix it or trigger further rotations.
Interesting, I would have thought that after some time (I think the default reconciliation period is 10h) the connection string will be updated by the operator. Has the operator reconciled in that time? I would expect it didn't execute another reconciliation loop in this case.
Has the operator reconciled in that time? I would expect it didn't execute another reconciliation loop in this case.
IIRC operator reports in logs that nothing to reconcile was found, so it does not take any action; I'll reproduce the issue and copy/paste here relevant logs, in case it's of interest for a related bug. I'll then keep this test cluster running, in case you want me to check something else while it's affected.
The operator should notice that the connection strings are not matching: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/v1.47.0/controllers/cluster_controller.go#L580-L586 and updates the connection string in the FoundationDBCluster
resource and then the updateStatus
reconciler will update the FoundationDBCluster
resource status: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/v1.47.0/controllers/update_status.go#L254-L263.
I am indeed running that version of the operator; perhaps there is a cached status issue at play? I'll come back here once I reproduce it.
I think that operator might have detected the problem but was desisting to do any change because cluster was unhealthy, which in turn was caused by clients using the incorrect connection string (depending on which of the 3 configmaps they are using); will report back as soon as I can confirm this.
While I wait to reproduce this issue anew, I have been looking at the logs I have about the last time this issue happened. Both those 2 theories I mentioned seem unfounded; does this perhaps shed some light?
e.g. is this evidence that operator is updating connection strings in some sort of loop?
I am indeed running that version of the operator; perhaps there is a cached status issue at play? I'll come back here once I reproduce it.
The cached status shouldn't be an issue because the status is only cached for one reconciliation.
While I wait to reproduce this issue anew, I have been looking at the logs I have about the last time this issue happened. Both those 2 theories I mentioned seem unfounded; does this perhaps shed some light?
Based one those logs it seems like the different coordinator instances have changed the coordinators? Might be worth to check for those logs: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/change_coordinators.go#L83 during the same timeframe.
What happened?
I have found a quirky issue happening when using
three_data_hall
; the issue consists in the k8s FoundationDB cluster status containing an outdated connection string which does not correspond to what the cluster is currently using:When checking directly via
get \xff\xff/connection_string
, the correct connection string for the cluster appears to be the one with generation ID9pION4FdMvW53gB4ikdjo61cs7HQKtK3
.What did you expect to happen?
The operator-maintained connection string field should always match
get \xff\xff/connection_string
.How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
Related: #1958
By my analysis the client issues are symptoms and not the cause: there is always going to be a client with an incorrect connection string if the operator-provided configmap does not contain the same connection string for all halls.
FDB Kubernetes operator
Kubernetes version
Cloud provider