Disable DR when a cluster is not responsive

Testing non-responsive cluster flow using https://github.com/RamenDR/ramen/pull/1133

Steps:

Configure test

$ git diff test/basic-test/config.yaml
...
---
-repo: https://github.com/ramendr/ocm-ramen-samples.git
-path: subscription
-branch: main
-name: busybox-sample
-namespace: busybox-sample
+repo: https://github.com/nirs/ocm-ramen-samples.git
+path: k8s/busybox-regional-rbd-deploy/sub
+branch: test
+name: busybox-regional-rbd-deploy
+namespace: busybox-regional-rbd-deploy
dr_policy: ramen-basic-test
 pvc_label: busybox

Deploy and enable dr with regional-dr env

env=$PWD/test/envs/regional-dr.yaml
test/basic-test/setup $env
test/basic-test/deploy $env
test/basic-test/enale-dr $env

Simulate disaster in current cluster (dr1)
```
virsh -c qemu:///system suspend dr1
```

Failover application to secondary cluster (dr2)

kubectl patch drpc kubevirt-drpc \
   --patch '{"spec": {"action": "Failover", "failoverCluster": "dr2"}}' \
   --type merge \
   --namespace busybox-regional-rbd-deploy \
   --context hub

Wait until application is running on dr2
Disable dr
```
test/basic/disable-dr $env
```
(stuck)

Actual result

Deleting drpc stuck

In ramen hub logs we see:

2023-11-20T13:31:20.864Z    INFO    controllers.DRPlacementControl  controllers/drplacementcontrol_controller.go:628    Error in deleting DRPC: (waiting for VRGs count to go to zero)  {"DRPC": "busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc", "rid": "25743882-77dc-4572-bdca-60d18d26c97d"}

ManagedClusterViews

We don't have any visibility on the cluster status in dr1 - we simply see the last reported status.

$ kubectl get managedclusterview busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv -n dr1 --context hub -o yaml
apiVersion: view.open-cluster-management.io/v1beta1
kind: ManagedClusterView
metadata:
  annotations:
    drplacementcontrol.ramendr.openshift.io/drpc-name: busybox-regional-rbd-deploy-drpc
    drplacementcontrol.ramendr.openshift.io/drpc-namespace: busybox-regional-rbd-deploy
  creationTimestamp: "2023-11-20T13:09:33Z"
  generation: 1
  name: busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv
  namespace: dr1
  resourceVersion: "9665"
  uid: d039bf36-90ac-47eb-abed-fb044d3f5e03
spec:
  scope:
    apiGroup: ramendr.openshift.io
    kind: VolumeReplicationGroup
    name: busybox-regional-rbd-deploy-drpc
    namespace: busybox-regional-rbd-deploy
    version: v1alpha1
status:
  conditions:
  - lastTransitionTime: "2023-11-20T13:10:03Z"
    message: Watching resources successfully
    reason: GetResourceProcessing
    status: "True"
    type: Processing
  result:
    apiVersion: ramendr.openshift.io/v1alpha1
    kind: VolumeReplicationGroup
    metadata:
      creationTimestamp: "2023-11-20T13:09:34Z"
      finalizers:
      - volumereplicationgroups.ramendr.openshift.io/vrg-protection
      generation: 1
      managedFields:
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:finalizers:
              .: {}
              v:"volumereplicationgroups.ramendr.openshift.io/vrg-protection": {}
        manager: manager
        operation: Update
        time: "2023-11-20T13:09:34Z"
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:ownerReferences:
              .: {}
              k:{"uid":"d59b5f73-9b77-4643-a95f-cbeeb9439ac3"}: {}
          f:spec:
            .: {}
            f:async:
              .: {}
              f:replicationClassSelector: {}
              f:schedulingInterval: {}
              f:volumeSnapshotClassSelector: {}
            f:pvcSelector: {}
            f:replicationState: {}
            f:s3Profiles: {}
            f:volSync: {}
        manager: work
        operation: Update
        time: "2023-11-20T13:09:34Z"
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:status:
            .: {}
            f:conditions: {}
            f:kubeObjectProtection: {}
            f:lastGroupSyncBytes: {}
            f:lastGroupSyncDuration: {}
            f:lastGroupSyncTime: {}
            f:lastUpdateTime: {}
            f:observedGeneration: {}
            f:protectedPVCs: {}
            f:state: {}
        manager: manager
        operation: Update
        subresource: status
        time: "2023-11-20T13:11:38Z"
      name: busybox-regional-rbd-deploy-drpc
      namespace: busybox-regional-rbd-deploy
      ownerReferences:
      - apiVersion: work.open-cluster-management.io/v1
        kind: AppliedManifestWork
        name: da6717d4434fc933ac3d041c0fe2591a3f6eb404c56acf93a31ce29681455949-busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mw
        uid: d59b5f73-9b77-4643-a95f-cbeeb9439ac3
      resourceVersion: "17240"
      uid: 0c9bf489-bde7-4c74-a644-ea94f39701b1
    spec:
      async:
        replicationClassSelector: {}
        schedulingInterval: 1m
        volumeSnapshotClassSelector: {}
      pvcSelector:
        matchLabels:
          appname: busybox
      replicationState: primary
      s3Profiles:
      - minio-on-dr1
      - minio-on-dr2
      volSync: {}
    status:
      conditions:
      - lastTransitionTime: "2023-11-20T13:09:37Z"
        message: PVCs in the VolumeReplicationGroup are ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-11-20T13:09:36Z"
        message: VolumeReplicationGroup is replicating
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      - lastTransitionTime: "2023-11-20T13:09:34Z"
        message: Restored cluster data
        observedGeneration: 1
        reason: Restored
        status: "True"
        type: ClusterDataReady
      - lastTransitionTime: "2023-11-20T13:09:36Z"
        message: Kube objects protected
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      kubeObjectProtection: {}
      lastGroupSyncBytes: 81920
      lastGroupSyncDuration: 0s
      lastGroupSyncTime: "2023-11-20T13:11:01Z"
      lastUpdateTime: "2023-11-20T13:11:38Z"
      observedGeneration: 1
      protectedPVCs:
      - accessModes:
        - ReadWriteOnce
        conditions:
        - lastTransitionTime: "2023-11-20T13:09:37Z"
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Ready
          status: "True"
          type: DataReady
        - lastTransitionTime: "2023-11-20T13:09:36Z"
          message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [minio-on-dr1
            minio-on-dr2]'
          observedGeneration: 1
          reason: Uploaded
          status: "True"
          type: ClusterDataProtected
        - lastTransitionTime: "2023-11-20T13:09:37Z"
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Replicating
          status: "False"
          type: DataProtected
        csiProvisioner: rook-ceph.rbd.csi.ceph.com
        labels:
          app: busybox-regional-rbd-deploy
          app.kubernetes.io/part-of: busybox-regional-rbd-deploy
          appname: busybox
          ramendr.openshift.io/owner-name: busybox-regional-rbd-deploy-drpc
          ramendr.openshift.io/owner-namespace-name: busybox-regional-rbd-deploy
        lastSyncBytes: 81920
        lastSyncDuration: 0s
        lastSyncTime: "2023-11-20T13:11:01Z"
        name: busybox-pvc
        namespace: busybox-regional-rbd-deploy
        replicationID:
          id: ""
        resources:
          requests:
            storage: 1Gi
        storageClassName: rook-ceph-block
        storageID:
          id: ""
      state: Primary

On dr2 we see an error condition trying to upload data to s3 store on dr1

$ kubectl get managedclusterview busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv -n dr2 --context hub -o yaml
apiVersion: view.open-cluster-management.io/v1beta1
kind: ManagedClusterView
metadata:
  annotations:
    drplacementcontrol.ramendr.openshift.io/drpc-name: busybox-regional-rbd-deploy-drpc
    drplacementcontrol.ramendr.openshift.io/drpc-namespace: busybox-regional-rbd-deploy
  creationTimestamp: "2023-11-20T13:09:33Z"
  generation: 1
  name: busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv
  namespace: dr2
  resourceVersion: "10795"
  uid: c17d2721-c321-4328-90d9-dcda09eb2608
spec:
  scope:
    apiGroup: ramendr.openshift.io
    kind: VolumeReplicationGroup
    name: busybox-regional-rbd-deploy-drpc
    namespace: busybox-regional-rbd-deploy
    version: v1alpha1
status:
  conditions:
  - lastTransitionTime: "2023-11-20T13:12:33Z"
    message: Watching resources successfully
    reason: GetResourceProcessing
    status: "True"
    type: Processing
  result:
    apiVersion: ramendr.openshift.io/v1alpha1
    kind: VolumeReplicationGroup
    metadata:
      creationTimestamp: "2023-11-20T13:12:27Z"
      deletionGracePeriodSeconds: 0
      deletionTimestamp: "2023-11-20T13:18:10Z"
      finalizers:
      - volumereplicationgroups.ramendr.openshift.io/vrg-protection
      generation: 2
      managedFields:
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:finalizers:
              .: {}
              v:"volumereplicationgroups.ramendr.openshift.io/vrg-protection": {}
        manager: manager
        operation: Update
        time: "2023-11-20T13:12:27Z"
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:ownerReferences:
              .: {}
              k:{"uid":"35837d74-d9b2-49d9-be70-1b2cb8db754a"}: {}
          f:spec:
            .: {}
            f:action: {}
            f:async:
              .: {}
              f:replicationClassSelector: {}
              f:schedulingInterval: {}
              f:volumeSnapshotClassSelector: {}
            f:pvcSelector: {}
            f:replicationState: {}
            f:s3Profiles: {}
            f:volSync: {}
        manager: work
        operation: Update
        time: "2023-11-20T13:12:27Z"
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:status:
            .: {}
            f:conditions: {}
            f:kubeObjectProtection: {}
            f:lastUpdateTime: {}
            f:observedGeneration: {}
            f:protectedPVCs: {}
            f:state: {}
        manager: manager
        operation: Update
        subresource: status
        time: "2023-11-20T13:13:44Z"
      name: busybox-regional-rbd-deploy-drpc
      namespace: busybox-regional-rbd-deploy
      ownerReferences:
      - apiVersion: work.open-cluster-management.io/v1
        kind: AppliedManifestWork
        name: da6717d4434fc933ac3d041c0fe2591a3f6eb404c56acf93a31ce29681455949-busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mw
        uid: 35837d74-d9b2-49d9-be70-1b2cb8db754a
      resourceVersion: "18956"
      uid: 1527508b-cbaf-4edd-9ec1-83758d7c466a
    spec:
      action: Failover
      async:
        replicationClassSelector: {}
        schedulingInterval: 1m
        volumeSnapshotClassSelector: {}
      pvcSelector:
        matchLabels:
          appname: busybox
      replicationState: primary
      s3Profiles:
      - minio-on-dr1
      - minio-on-dr2
      volSync: {}
    status:
      conditions:
      - lastTransitionTime: "2023-11-20T13:13:44Z"
        message: PVCs in the VolumeReplicationGroup are ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-11-20T13:13:20Z"
        message: VolumeReplicationGroup is replicating
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      - lastTransitionTime: "2023-11-20T13:12:55Z"
        message: Restored cluster data
        observedGeneration: 1
        reason: Restored
        status: "True"
        type: ClusterDataReady
      - lastTransitionTime: "2023-11-20T13:13:20Z"
        message: Cluster data of one or more PVs are unprotected
        observedGeneration: 1
        reason: UploadError
        status: "False"
        type: ClusterDataProtected
      kubeObjectProtection: {}
      lastUpdateTime: "2023-11-20T13:13:44Z"
      observedGeneration: 1
      protectedPVCs:
      - accessModes:
        - ReadWriteOnce
        conditions:
        - lastTransitionTime: "2023-11-20T13:13:20Z"
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Ready
          status: "True"
          type: DataReady
        - lastTransitionTime: "2023-11-20T13:13:07Z"
          message: |-
            error uploading PV to s3Profile minio-on-dr1, failed to protect cluster data for PVC busybox-pvc, failed to upload data of bucket:busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc/v1.PersistentVolume/pvc-410f8000-67d6-49bb-8574-cb57c6b4f13c, RequestError: send request failed
            caused by: Put "http://192.168.122.208:30000/bucket/busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc/v1.PersistentVolume/pvc-410f8000-67d6-49bb-8574-cb57c6b4f13c": dial tcp 192.168.122.208:30000: connect: no route to host
          observedGeneration: 1
          reason: UploadError
          status: "False"
          type: ClusterDataProtected
        - lastTransitionTime: "2023-11-20T13:13:20Z"
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Replicating
          status: "False"
          type: DataProtected
        csiProvisioner: rook-ceph.rbd.csi.ceph.com
        labels:
          app: busybox-regional-rbd-deploy
          app.kubernetes.io/part-of: busybox-regional-rbd-deploy
          appname: busybox
          ramendr.openshift.io/owner-name: busybox-regional-rbd-deploy-drpc
          ramendr.openshift.io/owner-namespace-name: busybox-regional-rbd-deploy
        name: busybox-pvc
        namespace: busybox-regional-rbd-deploy
        replicationID:
          id: ""
        resources:
          requests:
            storage: 1Gi
        storageClassName: rook-ceph-block
        storageID:
          id: ""
      state: Primary

DRCluster

There is no visibility on cluster status in drclusters:

$ kubectl get drcluster --context hub -o yaml
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRCluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"ramendr.openshift.io/v1alpha1","kind":"DRCluster","metadata":{"annotations":{},"name":"dr1"},"spec":{"region":"west","s3ProfileName":"minio-on-dr1"}}
    creationTimestamp: "2023-11-20T13:03:21Z"
    finalizers:
    - drclusters.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
    name: dr1
    resourceVersion: "8064"
    uid: 2e1cfaec-d6b3-4b46-bca6-058281ca285f
  spec:
    region: west
    s3ProfileName: minio-on-dr1
  status:
    conditions:
    - lastTransitionTime: "2023-11-20T13:03:21Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "False"
      type: Fenced
    - lastTransitionTime: "2023-11-20T13:03:21Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "True"
      type: Clean
    - lastTransitionTime: "2023-11-20T13:03:22Z"
      message: Validated the cluster
      observedGeneration: 1
      reason: Succeeded
      status: "True"
      type: Validated
    phase: Available
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRCluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"ramendr.openshift.io/v1alpha1","kind":"DRCluster","metadata":{"annotations":{},"name":"dr2"},"spec":{"region":"east","s3ProfileName":"minio-on-dr2"}}
    creationTimestamp: "2023-11-20T13:03:21Z"
    finalizers:
    - drclusters.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
    name: dr2
    resourceVersion: "8071"
    uid: f6b37742-35b9-4738-96e6-8a920739b9fc
  spec:
    region: east
    s3ProfileName: minio-on-dr2
  status:
    conditions:
    - lastTransitionTime: "2023-11-20T13:03:21Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "False"
      type: Fenced
    - lastTransitionTime: "2023-11-20T13:03:21Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "True"
      type: Clean
    - lastTransitionTime: "2023-11-20T13:03:22Z"
      message: Validated the cluster
      observedGeneration: 1
      reason: Succeeded
      status: "True"
      type: Validated
    phase: Available
kind: List
metadata:
  resourceVersion: ""

RamenDR / ramen

Disable DR when a cluster is not responsive #1139

Fix

Recommended flow

Alternative flow

Tasks

Actual result

Deleting drpc stuck

ManagedClusterViews

DRCluster