RamenDR / ramen

Apache License 2.0
73 stars 56 forks source link

Disable DR when a cluster is not responsive #1139

Closed nirs closed 11 months ago

nirs commented 11 months ago

So far we tested disable DR when both primary and secondary clusters are up. In disaster use case we may need to disable DR when the one of the clusters is not responsive. In this case we may not be able to clean up a cluster or even get the status of the cluster using ManagedClusterView.

Simulating non responsive cluster is easy with virsh:

virsh -c qemu:///system suspend dr1

Recover a cluster:

virsh -c qemu:///system resume dr1

Tested during failover, suspend cluster before failover, resume after application running on the failover cluster.

Fix

Support marking a drcluster as unavailable. When cluster is unavailable:

Recommended flow

  1. Mark the cluster as unavailable
  2. Failover the application to the good cluster
  3. Fix the drpolicy predicates if needed
  4. Delete the drpc
  5. Delete the policy annotation disabling OCM scheduling
  6. When DR was disable for all applications, delete the drpolicy referencing the unavailable drcluster and the drcluster resource.
  7. Replace the unavailable cluster
  8. Enable DR again for the applications

Alternative flow

It the user will forget to mark a cluster as unavailable before disabling DR, disable dr will be stuck:

Marking the cluster as unavailable should fix the issue but may require more manual work.

  1. Failover the application to the good cluster
  2. Fix the drpolicy predicates if needed
  3. Delete the drpc - stuck because the cluster is unavailable
  4. Mark the drcluster as unavailable to make delete drpc finish
  5. Delete the policy annotation disabling OCM scheduling
  6. When DR was disable for all applications, delete the drpolicy referencing the unavailable drcluster and the drcluster resource.
  7. Replace the unavailable cluster
  8. Enable DR again for the applications

Issues:

Tasks

Similar k8s flows:

nirs commented 11 months ago

Testing non-responsive cluster flow using https://github.com/RamenDR/ramen/pull/1133

Steps:

  1. Configure test

    $ git diff test/basic-test/config.yaml
    ...
    ---
    -repo: https://github.com/ramendr/ocm-ramen-samples.git
    -path: subscription
    -branch: main
    -name: busybox-sample
    -namespace: busybox-sample
    +repo: https://github.com/nirs/ocm-ramen-samples.git
    +path: k8s/busybox-regional-rbd-deploy/sub
    +branch: test
    +name: busybox-regional-rbd-deploy
    +namespace: busybox-regional-rbd-deploy
    dr_policy: ramen-basic-test
     pvc_label: busybox
  2. Deploy and enable dr with regional-dr env

    env=$PWD/test/envs/regional-dr.yaml
    test/basic-test/setup $env
    test/basic-test/deploy $env
    test/basic-test/enale-dr $env
  3. Simulate disaster in current cluster (dr1)

    virsh -c qemu:///system suspend dr1
  4. Failover application to secondary cluster (dr2)

    kubectl patch drpc kubevirt-drpc \
       --patch '{"spec": {"action": "Failover", "failoverCluster": "dr2"}}' \
       --type merge \
       --namespace busybox-regional-rbd-deploy \
       --context hub
  5. Wait until application is running on dr2

  6. Disable dr

    test/basic/disable-dr $env

    (stuck)

Actual result

Deleting drpc stuck

In ramen hub logs we see:

2023-11-20T13:31:20.864Z    INFO    controllers.DRPlacementControl  controllers/drplacementcontrol_controller.go:628    Error in deleting DRPC: (waiting for VRGs count to go to zero)  {"DRPC": "busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc", "rid": "25743882-77dc-4572-bdca-60d18d26c97d"}

ManagedClusterViews

We don't have any visibility on the cluster status in dr1 - we simply see the last reported status.

$ kubectl get managedclusterview busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv -n dr1 --context hub -o yaml
apiVersion: view.open-cluster-management.io/v1beta1
kind: ManagedClusterView
metadata:
  annotations:
    drplacementcontrol.ramendr.openshift.io/drpc-name: busybox-regional-rbd-deploy-drpc
    drplacementcontrol.ramendr.openshift.io/drpc-namespace: busybox-regional-rbd-deploy
  creationTimestamp: "2023-11-20T13:09:33Z"
  generation: 1
  name: busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv
  namespace: dr1
  resourceVersion: "9665"
  uid: d039bf36-90ac-47eb-abed-fb044d3f5e03
spec:
  scope:
    apiGroup: ramendr.openshift.io
    kind: VolumeReplicationGroup
    name: busybox-regional-rbd-deploy-drpc
    namespace: busybox-regional-rbd-deploy
    version: v1alpha1
status:
  conditions:
  - lastTransitionTime: "2023-11-20T13:10:03Z"
    message: Watching resources successfully
    reason: GetResourceProcessing
    status: "True"
    type: Processing
  result:
    apiVersion: ramendr.openshift.io/v1alpha1
    kind: VolumeReplicationGroup
    metadata:
      creationTimestamp: "2023-11-20T13:09:34Z"
      finalizers:
      - volumereplicationgroups.ramendr.openshift.io/vrg-protection
      generation: 1
      managedFields:
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:finalizers:
              .: {}
              v:"volumereplicationgroups.ramendr.openshift.io/vrg-protection": {}
        manager: manager
        operation: Update
        time: "2023-11-20T13:09:34Z"
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:ownerReferences:
              .: {}
              k:{"uid":"d59b5f73-9b77-4643-a95f-cbeeb9439ac3"}: {}
          f:spec:
            .: {}
            f:async:
              .: {}
              f:replicationClassSelector: {}
              f:schedulingInterval: {}
              f:volumeSnapshotClassSelector: {}
            f:pvcSelector: {}
            f:replicationState: {}
            f:s3Profiles: {}
            f:volSync: {}
        manager: work
        operation: Update
        time: "2023-11-20T13:09:34Z"
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:status:
            .: {}
            f:conditions: {}
            f:kubeObjectProtection: {}
            f:lastGroupSyncBytes: {}
            f:lastGroupSyncDuration: {}
            f:lastGroupSyncTime: {}
            f:lastUpdateTime: {}
            f:observedGeneration: {}
            f:protectedPVCs: {}
            f:state: {}
        manager: manager
        operation: Update
        subresource: status
        time: "2023-11-20T13:11:38Z"
      name: busybox-regional-rbd-deploy-drpc
      namespace: busybox-regional-rbd-deploy
      ownerReferences:
      - apiVersion: work.open-cluster-management.io/v1
        kind: AppliedManifestWork
        name: da6717d4434fc933ac3d041c0fe2591a3f6eb404c56acf93a31ce29681455949-busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mw
        uid: d59b5f73-9b77-4643-a95f-cbeeb9439ac3
      resourceVersion: "17240"
      uid: 0c9bf489-bde7-4c74-a644-ea94f39701b1
    spec:
      async:
        replicationClassSelector: {}
        schedulingInterval: 1m
        volumeSnapshotClassSelector: {}
      pvcSelector:
        matchLabels:
          appname: busybox
      replicationState: primary
      s3Profiles:
      - minio-on-dr1
      - minio-on-dr2
      volSync: {}
    status:
      conditions:
      - lastTransitionTime: "2023-11-20T13:09:37Z"
        message: PVCs in the VolumeReplicationGroup are ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-11-20T13:09:36Z"
        message: VolumeReplicationGroup is replicating
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      - lastTransitionTime: "2023-11-20T13:09:34Z"
        message: Restored cluster data
        observedGeneration: 1
        reason: Restored
        status: "True"
        type: ClusterDataReady
      - lastTransitionTime: "2023-11-20T13:09:36Z"
        message: Kube objects protected
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      kubeObjectProtection: {}
      lastGroupSyncBytes: 81920
      lastGroupSyncDuration: 0s
      lastGroupSyncTime: "2023-11-20T13:11:01Z"
      lastUpdateTime: "2023-11-20T13:11:38Z"
      observedGeneration: 1
      protectedPVCs:
      - accessModes:
        - ReadWriteOnce
        conditions:
        - lastTransitionTime: "2023-11-20T13:09:37Z"
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Ready
          status: "True"
          type: DataReady
        - lastTransitionTime: "2023-11-20T13:09:36Z"
          message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [minio-on-dr1
            minio-on-dr2]'
          observedGeneration: 1
          reason: Uploaded
          status: "True"
          type: ClusterDataProtected
        - lastTransitionTime: "2023-11-20T13:09:37Z"
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Replicating
          status: "False"
          type: DataProtected
        csiProvisioner: rook-ceph.rbd.csi.ceph.com
        labels:
          app: busybox-regional-rbd-deploy
          app.kubernetes.io/part-of: busybox-regional-rbd-deploy
          appname: busybox
          ramendr.openshift.io/owner-name: busybox-regional-rbd-deploy-drpc
          ramendr.openshift.io/owner-namespace-name: busybox-regional-rbd-deploy
        lastSyncBytes: 81920
        lastSyncDuration: 0s
        lastSyncTime: "2023-11-20T13:11:01Z"
        name: busybox-pvc
        namespace: busybox-regional-rbd-deploy
        replicationID:
          id: ""
        resources:
          requests:
            storage: 1Gi
        storageClassName: rook-ceph-block
        storageID:
          id: ""
      state: Primary

On dr2 we see an error condition trying to upload data to s3 store on dr1

$ kubectl get managedclusterview busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv -n dr2 --context hub -o yaml
apiVersion: view.open-cluster-management.io/v1beta1
kind: ManagedClusterView
metadata:
  annotations:
    drplacementcontrol.ramendr.openshift.io/drpc-name: busybox-regional-rbd-deploy-drpc
    drplacementcontrol.ramendr.openshift.io/drpc-namespace: busybox-regional-rbd-deploy
  creationTimestamp: "2023-11-20T13:09:33Z"
  generation: 1
  name: busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv
  namespace: dr2
  resourceVersion: "10795"
  uid: c17d2721-c321-4328-90d9-dcda09eb2608
spec:
  scope:
    apiGroup: ramendr.openshift.io
    kind: VolumeReplicationGroup
    name: busybox-regional-rbd-deploy-drpc
    namespace: busybox-regional-rbd-deploy
    version: v1alpha1
status:
  conditions:
  - lastTransitionTime: "2023-11-20T13:12:33Z"
    message: Watching resources successfully
    reason: GetResourceProcessing
    status: "True"
    type: Processing
  result:
    apiVersion: ramendr.openshift.io/v1alpha1
    kind: VolumeReplicationGroup
    metadata:
      creationTimestamp: "2023-11-20T13:12:27Z"
      deletionGracePeriodSeconds: 0
      deletionTimestamp: "2023-11-20T13:18:10Z"
      finalizers:
      - volumereplicationgroups.ramendr.openshift.io/vrg-protection
      generation: 2
      managedFields:
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:finalizers:
              .: {}
              v:"volumereplicationgroups.ramendr.openshift.io/vrg-protection": {}
        manager: manager
        operation: Update
        time: "2023-11-20T13:12:27Z"
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:ownerReferences:
              .: {}
              k:{"uid":"35837d74-d9b2-49d9-be70-1b2cb8db754a"}: {}
          f:spec:
            .: {}
            f:action: {}
            f:async:
              .: {}
              f:replicationClassSelector: {}
              f:schedulingInterval: {}
              f:volumeSnapshotClassSelector: {}
            f:pvcSelector: {}
            f:replicationState: {}
            f:s3Profiles: {}
            f:volSync: {}
        manager: work
        operation: Update
        time: "2023-11-20T13:12:27Z"
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:status:
            .: {}
            f:conditions: {}
            f:kubeObjectProtection: {}
            f:lastUpdateTime: {}
            f:observedGeneration: {}
            f:protectedPVCs: {}
            f:state: {}
        manager: manager
        operation: Update
        subresource: status
        time: "2023-11-20T13:13:44Z"
      name: busybox-regional-rbd-deploy-drpc
      namespace: busybox-regional-rbd-deploy
      ownerReferences:
      - apiVersion: work.open-cluster-management.io/v1
        kind: AppliedManifestWork
        name: da6717d4434fc933ac3d041c0fe2591a3f6eb404c56acf93a31ce29681455949-busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mw
        uid: 35837d74-d9b2-49d9-be70-1b2cb8db754a
      resourceVersion: "18956"
      uid: 1527508b-cbaf-4edd-9ec1-83758d7c466a
    spec:
      action: Failover
      async:
        replicationClassSelector: {}
        schedulingInterval: 1m
        volumeSnapshotClassSelector: {}
      pvcSelector:
        matchLabels:
          appname: busybox
      replicationState: primary
      s3Profiles:
      - minio-on-dr1
      - minio-on-dr2
      volSync: {}
    status:
      conditions:
      - lastTransitionTime: "2023-11-20T13:13:44Z"
        message: PVCs in the VolumeReplicationGroup are ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-11-20T13:13:20Z"
        message: VolumeReplicationGroup is replicating
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      - lastTransitionTime: "2023-11-20T13:12:55Z"
        message: Restored cluster data
        observedGeneration: 1
        reason: Restored
        status: "True"
        type: ClusterDataReady
      - lastTransitionTime: "2023-11-20T13:13:20Z"
        message: Cluster data of one or more PVs are unprotected
        observedGeneration: 1
        reason: UploadError
        status: "False"
        type: ClusterDataProtected
      kubeObjectProtection: {}
      lastUpdateTime: "2023-11-20T13:13:44Z"
      observedGeneration: 1
      protectedPVCs:
      - accessModes:
        - ReadWriteOnce
        conditions:
        - lastTransitionTime: "2023-11-20T13:13:20Z"
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Ready
          status: "True"
          type: DataReady
        - lastTransitionTime: "2023-11-20T13:13:07Z"
          message: |-
            error uploading PV to s3Profile minio-on-dr1, failed to protect cluster data for PVC busybox-pvc, failed to upload data of bucket:busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc/v1.PersistentVolume/pvc-410f8000-67d6-49bb-8574-cb57c6b4f13c, RequestError: send request failed
            caused by: Put "http://192.168.122.208:30000/bucket/busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc/v1.PersistentVolume/pvc-410f8000-67d6-49bb-8574-cb57c6b4f13c": dial tcp 192.168.122.208:30000: connect: no route to host
          observedGeneration: 1
          reason: UploadError
          status: "False"
          type: ClusterDataProtected
        - lastTransitionTime: "2023-11-20T13:13:20Z"
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Replicating
          status: "False"
          type: DataProtected
        csiProvisioner: rook-ceph.rbd.csi.ceph.com
        labels:
          app: busybox-regional-rbd-deploy
          app.kubernetes.io/part-of: busybox-regional-rbd-deploy
          appname: busybox
          ramendr.openshift.io/owner-name: busybox-regional-rbd-deploy-drpc
          ramendr.openshift.io/owner-namespace-name: busybox-regional-rbd-deploy
        name: busybox-pvc
        namespace: busybox-regional-rbd-deploy
        replicationID:
          id: ""
        resources:
          requests:
            storage: 1Gi
        storageClassName: rook-ceph-block
        storageID:
          id: ""
      state: Primary

DRCluster

There is no visibility on cluster status in drclusters:

$ kubectl get drcluster --context hub -o yaml
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRCluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"ramendr.openshift.io/v1alpha1","kind":"DRCluster","metadata":{"annotations":{},"name":"dr1"},"spec":{"region":"west","s3ProfileName":"minio-on-dr1"}}
    creationTimestamp: "2023-11-20T13:03:21Z"
    finalizers:
    - drclusters.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
    name: dr1
    resourceVersion: "8064"
    uid: 2e1cfaec-d6b3-4b46-bca6-058281ca285f
  spec:
    region: west
    s3ProfileName: minio-on-dr1
  status:
    conditions:
    - lastTransitionTime: "2023-11-20T13:03:21Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "False"
      type: Fenced
    - lastTransitionTime: "2023-11-20T13:03:21Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "True"
      type: Clean
    - lastTransitionTime: "2023-11-20T13:03:22Z"
      message: Validated the cluster
      observedGeneration: 1
      reason: Succeeded
      status: "True"
      type: Validated
    phase: Available
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRCluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"ramendr.openshift.io/v1alpha1","kind":"DRCluster","metadata":{"annotations":{},"name":"dr2"},"spec":{"region":"east","s3ProfileName":"minio-on-dr2"}}
    creationTimestamp: "2023-11-20T13:03:21Z"
    finalizers:
    - drclusters.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
    name: dr2
    resourceVersion: "8071"
    uid: f6b37742-35b9-4738-96e6-8a920739b9fc
  spec:
    region: east
    s3ProfileName: minio-on-dr2
  status:
    conditions:
    - lastTransitionTime: "2023-11-20T13:03:21Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "False"
      type: Fenced
    - lastTransitionTime: "2023-11-20T13:03:21Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "True"
      type: Clean
    - lastTransitionTime: "2023-11-20T13:03:22Z"
      message: Validated the cluster
      observedGeneration: 1
      reason: Succeeded
      status: "True"
      type: Validated
    phase: Available
kind: List
metadata:
  resourceVersion: ""