FoundationDB / fdb-kubernetes-operator

A kubernetes operator for FoundationDB
Apache License 2.0
247 stars 82 forks source link

Incomplete exclusion in FDB operator #1912

Closed larshagencognite closed 1 hour ago

larshagencognite commented 11 months ago

What happened?

Operator 1.28.1, FDB 7.1.43 Using service for public IP, not using DNS

In order to recover a failing FDB cluster, we had

  1. Turned off the operator
  2. Excluded storage nodes using fdbcli
  3. Deleting PVCs, PODs and services when exclusion was complete
  4. Turn the operator back on
  5. The operator recreates the process groups that are marked for exclusion, but not completely excluded as far as the operator knows (as it was done manually).
  6. The newly created services get the same instance id as previously, but new public IP
  7. The process groups now has two IPs, the old and new
  8. Before the operator starts excluding the storage nodes, some data is rebalanced to the new nodes
  9. The operator excludes the storage nodes and then immediately deletes them before exclusion is complete.
  10. Logs show that excludes were done against the wrong IP, so the FDB responded that the nodes were not part of the cluster, so the operator thinks it can immediately move forward.

What did you expect to happen?

I might have expected the operator to not recreate process groups if they are already fully excluded. This could have been checked before recreation.

It is also a bit surprising that the operator will try to recreate the PVC of a process group that is marked for deletion, as I don't know if that will ever help recover any data.

When the nodes has been recreated with a new IP, I would expect the operator to verify that exclusion was completed for all IP addresses.

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a cluster with service IP and not using locality-based exclusion
  2. Fill with enough data to make exclusion take some time
  3. Start an exclusion
  4. Disable the operator
  5. Finish exclusion manually and delete pvcs, pods and services for excluded groups
  6. Re-enable operator
  7. The bug will cause data loss as some data is rebalanced back to recreated groups, before those groups are swiftly deleted.

Anything else we need to know?

This was also reported in this forum post: https://forums.foundationdb.org/t/incomplete-exclusion-in-fdb-operator/4301

FDB Kubernetes operator

```console $ kubectl fdb version 1.28.1 ```

Kubernetes version

```console $ kubectl version 1.26.7-gke.500 ```

Cloud provider

GCP
johscheuer commented 8 months ago

Sorry for the delay. The issue is/was how the operator runs the different reconcilers:

subReconcilers := []clusterSubReconciler{
    updateStatus{},
    updateLockConfiguration{},
    updateConfigMap{},
    checkClientCompatibility{},
    deletePodsForBuggification{},
    replaceMisconfiguredProcessGroups{},
    replaceFailedProcessGroups{},
    addProcessGroups{},
    addServices{},
    addPVCs{},
    addPods{}, // --> Creates the Pod
    generateInitialClusterFile{},
    removeIncompatibleProcesses{},
    updateSidecarVersions{},
    updatePodConfig{},
    updateMetadata{},
    updateDatabaseConfiguration{},
    chooseRemovals{},
    excludeProcesses{},
    changeCoordinators{},
    bounceProcesses{},
    maintenanceModeChecker{},
    updatePods{},
    removeProcessGroups{}, // --> Checks the exclusion state
    removeServices{},
    updateStatus{},
}

As the check for the exclusion state is at a later step the operator will create the Pods. The operator actually checks if all addresses are excluded: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/remove_process_groups.go#L355-L361. We can try to recreate this scenario in an e2e test case to see if we still see the issue.

Do you think you have the time/capacity to work on this?

johscheuer commented 5 months ago

@larshagencognite have you seen this issue again or are you able to write an e2e test for it?

johscheuer commented 2 weeks ago

Sorry for the late reply. If you manually excluded the processes and deleted the PVC + pods, a better choice is to use processGroupsToRemoveWithoutExclusion for removing the process groups https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/cluster_spec.md#foundationdbclusterspec. That way the IsExcluded() will return true in the AddPods reconciler and the pods will not be recreated (same for the pvc).

johscheuer commented 1 hour ago

I'm going to close this issue, feel free to reopen it if you think anything should be changed (e.g. documentation). Thanks again for reporting this issue.