Closed funkypenguin closed 2 years ago
The operator currently tries to select 5 coordinators (see: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/master/api/v1beta1/foundationdbcluster_types.go#L1365-L1371) that might be a little bit different to the documentation i only states Ideal number of coordination servers
so we might want to document that in a better place (if there is a gap). That means that you need at least 5 stateful Pods (log, storage). What Faultdomain
did you configure? The default will be host based (https://github.com/FoundationDB/fdb-kubernetes-operator/blob/master/docs/manual/fault_domains.md#option-1-single-kubernetes-replication) that means when multiple Pods are running on thee same host only one of them can be elected as a coordinator.
I think we should update the "error" message how many Pods we are expecting and how many we got.
Thanks @johscheuer - my takeaway here is that if I want triple redundancyMode (I do), I need 5 stateful pods, spread across 5 separate hosts. I've got this working now, and I've noted that if I take a host down (i.e., for upgrades/maintenance), my existing FDB clusters continue to operate (presumably degraded), but I'm unable to deploy any new clusters.
Cheers! D
By "unable to deploy any new clusters" you mean creating a different FoundationDB cluster in the same namespace? One broken cluster shouldn't block the other one, since the controller queue should ensure that both clusters are reconciled (and they are independent). It can take a longer time for the second cluster to reconcile since the first cluster will consume resources from the operator (the operator tries to elect a new coordinator until the quorum is reached again). I would recommend to run at least 6 stateful Pods for a cluster with triple redundancy. I'll add the documentation label and we should add this information to our user manual.
What I mean is that it appears that if I have:
.. and I loose one of my nodes (leaving 4 remaining), my reconciled FDB cluster can still be used.
However, until a 5th node is available again, I won't be able to deploy a new FDB cluster, since the operator won't start a triple-redundancy cluster on only 4 nodes.
D
What I mean is that it appears that if I have:
- a Kubernetes cluster with 5 nodes, with
- a reconciled FDB cluster in triple redundancy mode with
- 5 stateful pods, each on a separate host,
.. and I loose one of my nodes (leaving 4 remaining), my reconciled FDB cluster can still be used.
However, until a 5th node is available again, I won't be able to deploy a new FDB cluster, since the operator won't start a triple-redundancy cluster on only 4 nodes.
D
Yes, that makes sense since we are waiting for 5 stateful Pods. I'll add some additional information for the coordinators in our fault domain documentation.
Reminder to document this setting.
Hey guys!
On v0.38.0 (and 0.37.0) I've not been able to get my cluster to reconcile when I use triple redundancy mode.
Here's my CR:
The 4 pods are created within about 30s, but the operator continually logs:
As soon as I change
redunancy_mode
todouble
, the operator reconciles the cluster.My reading of https://apple.github.io/foundationdb/configuration.html#configuration-choosing-redundancy-mode indicates that 4 machines should be a supported configuration for triple redundancy mode. To be sure, I also tried
storage: 5
, but the same behaviour resulted.I've previously had triple redundancy mode working with this cluster configuration on 0.36.0.
Thanks! D