RamenDR / ramen

Apache License 2.0
73 stars 55 forks source link

vrg status does not go to primary when there is no PV in the namespace #922

Open njjry opened 1 year ago

njjry commented 1 year ago

I tested CPD ibm-common-services ns failover/failback, there are no PVs in the ns. After creating the vrg, it shows the below status:

$ kubectl get vrg co -n ibm-common-services -o yaml
apiVersion: ramendr.openshift.io/v1alpha1
kind: VolumeReplicationGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"ramendr.openshift.io/v1alpha1","kind":"VolumeReplicationGroup","metadata":{"annotations":{},"name":"co","namespace":"ibm-common-services"},"spec":{"kubeObjectProtection":{"recipeRef":{"name":"ibmcpd-operators"}},"pvcSelector":{},"replicationState":"primary","s3Profiles":["s3profile-cluster6-ocs-external-storagecluster","s3profile-cluster8-ocs-external-storagecluster"],"sync":{},"volSync":{"disabled":true}}}
  creationTimestamp: "2023-06-09T21:47:27Z"
  finalizers:
  - volumereplicationgroups.ramendr.openshift.io/vrg-protection
  generation: 2
  name: co
  namespace: ibm-common-services
  resourceVersion: "184963306"
  uid: e51c4abf-4b10-412d-9a4d-27a22421fc96
spec:
  kubeObjectProtection:
    recipeRef:
      name: ibmcpd-operators
  pvcSelector: {}
  replicationState: primary
  s3Profiles:
  - s3profile-cluster6-ocs-external-storagecluster
  - s3profile-cluster8-ocs-external-storagecluster
  sync: {}
  volSync:
    disabled: true
status:
  conditions:
  - lastTransitionTime: "2023-06-09T21:47:27Z"
    message: Failed to get list of pvcs
    observedGeneration: 1
    reason: Error
    status: "False"
    type: DataReady
  - lastTransitionTime: "2023-06-09T21:47:27Z"
    message: Initializing VolumeReplicationGroup
    observedGeneration: 1
    reason: Initializing
    status: Unknown
    type: DataProtected
  - lastTransitionTime: "2023-06-09T22:13:07Z"
    message: Restored cluster data
    observedGeneration: 1
    reason: Restored
    status: "True"
    type: ClusterDataReady
  - lastTransitionTime: "2023-06-09T22:13:08Z"
    message: Kube objects protected
    observedGeneration: 1
    reason: Uploaded
    status: "True"
    type: ClusterDataProtected
  kubeObjectProtection:
    captureToRecoverFrom:
      number: 1
      startGeneration: 1
      startTime: "2023-06-09T21:23:34Z"
  lastUpdateTime: "2023-06-09T22:20:34Z"
  observedGeneration: 2
  state: Unknown

Although all the cluster resources have been protected, the state does not go to primary.

tjanssen3 commented 1 year ago

Adding some details to this, as I encountered it recently.

It appears that if there are no PVCs found by either PVC Label Selectors or Recipe Volume Group, Ramen won't populate the VRG's status.state with a Primary or Secondary status. That's the main issue - it just looks like the VRG isn't reconciling and leaves status.state as Unknown or empty.

However, lacking status.state does not prevent kubeObjectProtection from taking backups - and if the user fixes the issue (the VRG finds PVCs to protect), it will interpret those existing backups as candidates to restore when Primary status is achieved. That second part is perhaps a "bug of a bug", but those are the circumstance where I experienced this.

hatfieldbrian commented 1 year ago

I just encountered this during a failover on the "from" side which was cluster6:

$ oc get vrg --context kafka/api-cluster6-local:6443/kube:admin -oyaml bb
apiVersion: ramendr.openshift.io/v1alpha1
kind: VolumeReplicationGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"ramendr.openshift.io/v1alpha1","kind":"VolumeReplicationGroup","metadata":{"annotations":{},"name":"bb","namespace":"kafka"},"spec":{"kubeObjectProtection":{"captureInterval":"1m","recipeRef":{"name":"recipe-kafka"}},"pvcSelector":{"matchLabels":{}},"replicationState":"primary","s3Profiles":["s3profile-cluster6-ocs-external-storagecluster","s3profile-cluster8-ocs-external-storagecluster"],"sync":{}}}
  creationTimestamp: "2023-07-25T23:36:42Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2023-07-26T00:17:36Z"
  finalizers:
  - volumereplicationgroups.ramendr.openshift.io/vrg-protection
  generation: 5
  name: bb
  namespace: kafka
  resourceVersion: "308674758"
  uid: c33a49c3-87b0-4b49-98fe-a78771285a96
spec:
  action: Failover
  kubeObjectProtection:
    captureInterval: 1m0s
    recipeRef:
      name: recipe-kafka
  pvcSelector: {}
  replicationState: secondary
  s3Profiles:
  - s3profile-cluster6-ocs-external-storagecluster
  - s3profile-cluster8-ocs-external-storagecluster
  sync: {}
  volSync: {}
status:
  conditions:
  - lastTransitionTime: "2023-07-26T00:17:47Z"
    message: Failed to get list of pvcs
    observedGeneration: 5
    reason: Error
    status: "False"
    type: DataReady
  - lastTransitionTime: "2023-07-26T00:17:30Z"
    message: VolumeReplicationGroup is replicating
    observedGeneration: 4
    reason: Replicating
    status: "False"
    type: DataProtected
  - lastTransitionTime: "2023-07-25T23:36:43Z"
    message: Restored cluster data
    observedGeneration: 2
    reason: Restored
    status: "True"
    type: ClusterDataReady
  - lastTransitionTime: "2023-07-26T00:17:30Z"
    message: Cluster data of all PVs are protected
    observedGeneration: 4
    reason: Uploaded
    status: "True"
    type: ClusterDataProtected
  kubeObjectProtection:
    captureToRecoverFrom:
      number: 1
      startGeneration: 2
      startTime: "2023-07-26T00:04:21Z"
  lastUpdateTime: "2023-07-26T00:17:47Z"
  observedGeneration: 5
  protectedPVCs:
  - conditions:
    - lastTransitionTime: "2023-07-26T00:17:30Z"
      message: Secondary transition failed as PersistentVolume for PVC is still attached
        to node(s)
      observedGeneration: 4
      reason: Progressing
      status: "False"
      type: DataReady
    - lastTransitionTime: "2023-07-25T23:36:43Z"
      message: PVC in the VolumeReplicationGroup is ready for use
      observedGeneration: 2
      reason: Replicating
      status: "False"
      type: DataProtected
    - lastTransitionTime: "2023-07-25T23:36:45Z"
      message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-cluster6-ocs-external-storagecluster
        s3profile-cluster8-ocs-external-storagecluster]'
      observedGeneration: 2
      reason: Uploaded
      status: "True"
      type: ClusterDataProtected
    name: data-my-cluster-zookeeper-2
    replicationID:
      id: ""
    resources: {}
    storageID:
      id: ""
  - conditions:
    - lastTransitionTime: "2023-07-26T00:17:30Z"
      message: Secondary transition failed as PersistentVolume for PVC is still attached
        to node(s)
      observedGeneration: 4
      reason: Progressing
      status: "False"
      type: DataReady
    - lastTransitionTime: "2023-07-25T23:36:45Z"
      message: PVC in the VolumeReplicationGroup is ready for use
      observedGeneration: 2
      reason: Replicating
      status: "False"
      type: DataProtected
    - lastTransitionTime: "2023-07-25T23:36:47Z"
      message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-cluster6-ocs-external-storagecluster
        s3profile-cluster8-ocs-external-storagecluster]'
      observedGeneration: 2
      reason: Uploaded
      status: "True"
      type: ClusterDataProtected
    name: data-my-cluster-zookeeper-0
    replicationID:
      id: ""
    resources: {}
    storageID:
      id: ""
  - conditions:
    - lastTransitionTime: "2023-07-26T00:17:30Z"
      message: Secondary transition failed as PVC is potentially in use by a pod
      observedGeneration: 4
      reason: Progressing
      status: "False"
      type: DataReady
    - lastTransitionTime: "2023-07-25T23:36:47Z"
      message: PVC in the VolumeReplicationGroup is ready for use
      observedGeneration: 2
      reason: Replicating
      status: "False"
      type: DataProtected
    - lastTransitionTime: "2023-07-25T23:36:49Z"
      message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-cluster6-ocs-external-storagecluster
        s3profile-cluster8-ocs-external-storagecluster]'
      observedGeneration: 2
      reason: Uploaded
      status: "True"
      type: ClusterDataProtected
    name: data-my-cluster-kafka-1
    replicationID:
      id: ""
    resources: {}
    storageID:
      id: ""
  - conditions:
    - lastTransitionTime: "2023-07-26T00:17:30Z"
      message: Secondary transition failed as PVC is potentially in use by a pod
      observedGeneration: 4
      reason: Progressing
      status: "False"
      type: DataReady
    - lastTransitionTime: "2023-07-25T23:36:49Z"
      message: PVC in the VolumeReplicationGroup is ready for use
      observedGeneration: 2
      reason: Replicating
      status: "False"
      type: DataProtected
    - lastTransitionTime: "2023-07-25T23:36:51Z"
      message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-cluster6-ocs-external-storagecluster
        s3profile-cluster8-ocs-external-storagecluster]'
      observedGeneration: 2
      reason: Uploaded
      status: "True"
      type: ClusterDataProtected
    name: data-my-cluster-kafka-0
    replicationID:
      id: ""
    resources: {}
    storageID:
      id: ""
  - conditions:
    - lastTransitionTime: "2023-07-26T00:17:30Z"
      message: Secondary transition failed as PVC is potentially in use by a pod
      observedGeneration: 4
      reason: Progressing
      status: "False"
      type: DataReady
    - lastTransitionTime: "2023-07-25T23:36:51Z"
      message: PVC in the VolumeReplicationGroup is ready for use
      observedGeneration: 2
      reason: Replicating
      status: "False"
      type: DataProtected
    - lastTransitionTime: "2023-07-25T23:36:53Z"
      message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-cluster6-ocs-external-storagecluster
        s3profile-cluster8-ocs-external-storagecluster]'
      observedGeneration: 2
      reason: Uploaded
      status: "True"
      type: ClusterDataProtected
    name: data-my-cluster-kafka-2
    replicationID:
      id: ""
    resources: {}
    storageID:
      id: ""
  - conditions:
    - lastTransitionTime: "2023-07-26T00:17:30Z"
      message: Secondary transition failed as PersistentVolume for PVC is still attached
        to node(s)
      observedGeneration: 4
      reason: Progressing
      status: "False"
      type: DataReady
    - lastTransitionTime: "2023-07-25T23:36:53Z"
      message: PVC in the VolumeReplicationGroup is ready for use
      observedGeneration: 2
      reason: Replicating
      status: "False"
      type: DataProtected
    - lastTransitionTime: "2023-07-25T23:36:54Z"
      message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-cluster6-ocs-external-storagecluster
        s3profile-cluster8-ocs-external-storagecluster]'
      observedGeneration: 2
      reason: Uploaded
      status: "True"
      type: ClusterDataProtected
    name: data-my-cluster-zookeeper-1
    replicationID:
      id: ""
    resources: {}
    storageID:
      id: ""
  state: Unknown

Problem

The DataReady error message "Failed to get list of pvcs" changed to "Failed to process list of PVCs to protect" with https://github.com/RamenDR/ramen/commit/977c263fa7e00f1d27476c4a3f0379ac77f9dc0e on May 9, but the behavior remains the same.

The problem is that the Recipe that the VRG references is deleted before the VRG is. The VRG controller doesn't know whether the recipe contains a PVC label selector.

2023-07-26T10:51:48.240Z        ERROR   controllers.VolumeReplicationGroup      controllers/volumereplicationgroup_controller.go:301    GetRecipeWithName error: %s-%s  {"VolumeReplicationGroup": "kafka/bb", "rid": "77422bac-e47c-41ef-846c-79c759c406ab", "bb": "kafka", "error": "Recipe.ramendr.openshift.io \"recipe-kafka\" not found"}
github.com/ramendr/ramen/controllers.GetPVCLabelSelector
        /workspace/controllers/volumereplicationgroup_controller.go:301
github.com/ramendr/ramen/controllers.(*VRGInstance).listPVCsByPVCSelector
        /workspace/controllers/volumereplicationgroup_controller.go:636
github.com/ramendr/ramen/controllers.(*VRGInstance).updatePVCList
        /workspace/controllers/volumereplicationgroup_controller.go:647
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG
        /workspace/controllers/volumereplicationgroup_controller.go:499
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile
        /workspace/controllers/volumereplicationgroup_controller.go:404
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235
2023-07-26T10:51:48.240Z        ERROR   controllers.VolumeReplicationGroup      controllers/volumereplicationgroup_controller.go:500    Failed to update PersistentVolumeClaims for resource    {"VolumeReplicationGroup": "kafka/bb", "rid": "77422bac-e47c-41ef-846c-79c759c406ab", "error": "Recipe.ramendr.openshift.io \"recipe-kafka\" not found"}
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG
        /workspace/controllers/volumereplicationgroup_controller.go:500
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile
        /workspace/controllers/volumereplicationgroup_controller.go:404
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235

Approaches

Approach 1 - Change DataReady message to indicate recipe is missing so user can address issue

User could either delete recipe reference or restore the recipe

Approach 2 - Use VRG's PVC label selector if referenced recipe is missing

Recipe may not contain a PVC label selector

Approach 3 - Use latest captured recipe, if one exists

Issues #860 and #861 suggest approaches to capture recipe at Kube resources backup time

Recommended approach

I recommend approach 1 for the Ramen code. For the failover and failback scripts, I recommend adding a finalizer to the recipe and removing it once the VRG is finally deleted.

hatfieldbrian commented 1 year ago

Labeled low priority because of workaround to create dummy volume.