Closed larshagencognite closed 1 hour ago
Sorry for the delay. The issue is/was how the operator runs the different reconcilers:
subReconcilers := []clusterSubReconciler{
updateStatus{},
updateLockConfiguration{},
updateConfigMap{},
checkClientCompatibility{},
deletePodsForBuggification{},
replaceMisconfiguredProcessGroups{},
replaceFailedProcessGroups{},
addProcessGroups{},
addServices{},
addPVCs{},
addPods{}, // --> Creates the Pod
generateInitialClusterFile{},
removeIncompatibleProcesses{},
updateSidecarVersions{},
updatePodConfig{},
updateMetadata{},
updateDatabaseConfiguration{},
chooseRemovals{},
excludeProcesses{},
changeCoordinators{},
bounceProcesses{},
maintenanceModeChecker{},
updatePods{},
removeProcessGroups{}, // --> Checks the exclusion state
removeServices{},
updateStatus{},
}
As the check for the exclusion state is at a later step the operator will create the Pods. The operator actually checks if all addresses are excluded: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/remove_process_groups.go#L355-L361. We can try to recreate this scenario in an e2e test case to see if we still see the issue.
Do you think you have the time/capacity to work on this?
@larshagencognite have you seen this issue again or are you able to write an e2e test for it?
Sorry for the late reply. If you manually excluded the processes and deleted the PVC + pods, a better choice is to use processGroupsToRemoveWithoutExclusion
for removing the process groups https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/cluster_spec.md#foundationdbclusterspec. That way the IsExcluded()
will return true in the AddPods reconciler and the pods will not be recreated (same for the pvc).
I'm going to close this issue, feel free to reopen it if you think anything should be changed (e.g. documentation). Thanks again for reporting this issue.
What happened?
Operator 1.28.1, FDB 7.1.43 Using service for public IP, not using DNS
In order to recover a failing FDB cluster, we had
What did you expect to happen?
I might have expected the operator to not recreate process groups if they are already fully excluded. This could have been checked before recreation.
It is also a bit surprising that the operator will try to recreate the PVC of a process group that is marked for deletion, as I don't know if that will ever help recover any data.
When the nodes has been recreated with a new IP, I would expect the operator to verify that exclusion was completed for all IP addresses.
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
This was also reported in this forum post: https://forums.foundationdb.org/t/incomplete-exclusion-in-fdb-operator/4301
FDB Kubernetes operator
Kubernetes version
Cloud provider