Closed nirs closed 11 months ago
Testing non-responsive cluster flow using https://github.com/RamenDR/ramen/pull/1133
Steps:
Configure test
$ git diff test/basic-test/config.yaml
...
---
-repo: https://github.com/ramendr/ocm-ramen-samples.git
-path: subscription
-branch: main
-name: busybox-sample
-namespace: busybox-sample
+repo: https://github.com/nirs/ocm-ramen-samples.git
+path: k8s/busybox-regional-rbd-deploy/sub
+branch: test
+name: busybox-regional-rbd-deploy
+namespace: busybox-regional-rbd-deploy
dr_policy: ramen-basic-test
pvc_label: busybox
Deploy and enable dr with regional-dr env
env=$PWD/test/envs/regional-dr.yaml
test/basic-test/setup $env
test/basic-test/deploy $env
test/basic-test/enale-dr $env
Simulate disaster in current cluster (dr1)
virsh -c qemu:///system suspend dr1
Failover application to secondary cluster (dr2)
kubectl patch drpc kubevirt-drpc \
--patch '{"spec": {"action": "Failover", "failoverCluster": "dr2"}}' \
--type merge \
--namespace busybox-regional-rbd-deploy \
--context hub
Wait until application is running on dr2
Disable dr
test/basic/disable-dr $env
(stuck)
In ramen hub logs we see:
2023-11-20T13:31:20.864Z INFO controllers.DRPlacementControl controllers/drplacementcontrol_controller.go:628 Error in deleting DRPC: (waiting for VRGs count to go to zero) {"DRPC": "busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc", "rid": "25743882-77dc-4572-bdca-60d18d26c97d"}
We don't have any visibility on the cluster status in dr1 - we simply see the last reported status.
$ kubectl get managedclusterview busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv -n dr1 --context hub -o yaml
apiVersion: view.open-cluster-management.io/v1beta1
kind: ManagedClusterView
metadata:
annotations:
drplacementcontrol.ramendr.openshift.io/drpc-name: busybox-regional-rbd-deploy-drpc
drplacementcontrol.ramendr.openshift.io/drpc-namespace: busybox-regional-rbd-deploy
creationTimestamp: "2023-11-20T13:09:33Z"
generation: 1
name: busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv
namespace: dr1
resourceVersion: "9665"
uid: d039bf36-90ac-47eb-abed-fb044d3f5e03
spec:
scope:
apiGroup: ramendr.openshift.io
kind: VolumeReplicationGroup
name: busybox-regional-rbd-deploy-drpc
namespace: busybox-regional-rbd-deploy
version: v1alpha1
status:
conditions:
- lastTransitionTime: "2023-11-20T13:10:03Z"
message: Watching resources successfully
reason: GetResourceProcessing
status: "True"
type: Processing
result:
apiVersion: ramendr.openshift.io/v1alpha1
kind: VolumeReplicationGroup
metadata:
creationTimestamp: "2023-11-20T13:09:34Z"
finalizers:
- volumereplicationgroups.ramendr.openshift.io/vrg-protection
generation: 1
managedFields:
- apiVersion: ramendr.openshift.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.: {}
v:"volumereplicationgroups.ramendr.openshift.io/vrg-protection": {}
manager: manager
operation: Update
time: "2023-11-20T13:09:34Z"
- apiVersion: ramendr.openshift.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:ownerReferences:
.: {}
k:{"uid":"d59b5f73-9b77-4643-a95f-cbeeb9439ac3"}: {}
f:spec:
.: {}
f:async:
.: {}
f:replicationClassSelector: {}
f:schedulingInterval: {}
f:volumeSnapshotClassSelector: {}
f:pvcSelector: {}
f:replicationState: {}
f:s3Profiles: {}
f:volSync: {}
manager: work
operation: Update
time: "2023-11-20T13:09:34Z"
- apiVersion: ramendr.openshift.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:status:
.: {}
f:conditions: {}
f:kubeObjectProtection: {}
f:lastGroupSyncBytes: {}
f:lastGroupSyncDuration: {}
f:lastGroupSyncTime: {}
f:lastUpdateTime: {}
f:observedGeneration: {}
f:protectedPVCs: {}
f:state: {}
manager: manager
operation: Update
subresource: status
time: "2023-11-20T13:11:38Z"
name: busybox-regional-rbd-deploy-drpc
namespace: busybox-regional-rbd-deploy
ownerReferences:
- apiVersion: work.open-cluster-management.io/v1
kind: AppliedManifestWork
name: da6717d4434fc933ac3d041c0fe2591a3f6eb404c56acf93a31ce29681455949-busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mw
uid: d59b5f73-9b77-4643-a95f-cbeeb9439ac3
resourceVersion: "17240"
uid: 0c9bf489-bde7-4c74-a644-ea94f39701b1
spec:
async:
replicationClassSelector: {}
schedulingInterval: 1m
volumeSnapshotClassSelector: {}
pvcSelector:
matchLabels:
appname: busybox
replicationState: primary
s3Profiles:
- minio-on-dr1
- minio-on-dr2
volSync: {}
status:
conditions:
- lastTransitionTime: "2023-11-20T13:09:37Z"
message: PVCs in the VolumeReplicationGroup are ready for use
observedGeneration: 1
reason: Ready
status: "True"
type: DataReady
- lastTransitionTime: "2023-11-20T13:09:36Z"
message: VolumeReplicationGroup is replicating
observedGeneration: 1
reason: Replicating
status: "False"
type: DataProtected
- lastTransitionTime: "2023-11-20T13:09:34Z"
message: Restored cluster data
observedGeneration: 1
reason: Restored
status: "True"
type: ClusterDataReady
- lastTransitionTime: "2023-11-20T13:09:36Z"
message: Kube objects protected
observedGeneration: 1
reason: Uploaded
status: "True"
type: ClusterDataProtected
kubeObjectProtection: {}
lastGroupSyncBytes: 81920
lastGroupSyncDuration: 0s
lastGroupSyncTime: "2023-11-20T13:11:01Z"
lastUpdateTime: "2023-11-20T13:11:38Z"
observedGeneration: 1
protectedPVCs:
- accessModes:
- ReadWriteOnce
conditions:
- lastTransitionTime: "2023-11-20T13:09:37Z"
message: PVC in the VolumeReplicationGroup is ready for use
observedGeneration: 1
reason: Ready
status: "True"
type: DataReady
- lastTransitionTime: "2023-11-20T13:09:36Z"
message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [minio-on-dr1
minio-on-dr2]'
observedGeneration: 1
reason: Uploaded
status: "True"
type: ClusterDataProtected
- lastTransitionTime: "2023-11-20T13:09:37Z"
message: PVC in the VolumeReplicationGroup is ready for use
observedGeneration: 1
reason: Replicating
status: "False"
type: DataProtected
csiProvisioner: rook-ceph.rbd.csi.ceph.com
labels:
app: busybox-regional-rbd-deploy
app.kubernetes.io/part-of: busybox-regional-rbd-deploy
appname: busybox
ramendr.openshift.io/owner-name: busybox-regional-rbd-deploy-drpc
ramendr.openshift.io/owner-namespace-name: busybox-regional-rbd-deploy
lastSyncBytes: 81920
lastSyncDuration: 0s
lastSyncTime: "2023-11-20T13:11:01Z"
name: busybox-pvc
namespace: busybox-regional-rbd-deploy
replicationID:
id: ""
resources:
requests:
storage: 1Gi
storageClassName: rook-ceph-block
storageID:
id: ""
state: Primary
On dr2 we see an error condition trying to upload data to s3 store on dr1
$ kubectl get managedclusterview busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv -n dr2 --context hub -o yaml
apiVersion: view.open-cluster-management.io/v1beta1
kind: ManagedClusterView
metadata:
annotations:
drplacementcontrol.ramendr.openshift.io/drpc-name: busybox-regional-rbd-deploy-drpc
drplacementcontrol.ramendr.openshift.io/drpc-namespace: busybox-regional-rbd-deploy
creationTimestamp: "2023-11-20T13:09:33Z"
generation: 1
name: busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv
namespace: dr2
resourceVersion: "10795"
uid: c17d2721-c321-4328-90d9-dcda09eb2608
spec:
scope:
apiGroup: ramendr.openshift.io
kind: VolumeReplicationGroup
name: busybox-regional-rbd-deploy-drpc
namespace: busybox-regional-rbd-deploy
version: v1alpha1
status:
conditions:
- lastTransitionTime: "2023-11-20T13:12:33Z"
message: Watching resources successfully
reason: GetResourceProcessing
status: "True"
type: Processing
result:
apiVersion: ramendr.openshift.io/v1alpha1
kind: VolumeReplicationGroup
metadata:
creationTimestamp: "2023-11-20T13:12:27Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2023-11-20T13:18:10Z"
finalizers:
- volumereplicationgroups.ramendr.openshift.io/vrg-protection
generation: 2
managedFields:
- apiVersion: ramendr.openshift.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.: {}
v:"volumereplicationgroups.ramendr.openshift.io/vrg-protection": {}
manager: manager
operation: Update
time: "2023-11-20T13:12:27Z"
- apiVersion: ramendr.openshift.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:ownerReferences:
.: {}
k:{"uid":"35837d74-d9b2-49d9-be70-1b2cb8db754a"}: {}
f:spec:
.: {}
f:action: {}
f:async:
.: {}
f:replicationClassSelector: {}
f:schedulingInterval: {}
f:volumeSnapshotClassSelector: {}
f:pvcSelector: {}
f:replicationState: {}
f:s3Profiles: {}
f:volSync: {}
manager: work
operation: Update
time: "2023-11-20T13:12:27Z"
- apiVersion: ramendr.openshift.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:status:
.: {}
f:conditions: {}
f:kubeObjectProtection: {}
f:lastUpdateTime: {}
f:observedGeneration: {}
f:protectedPVCs: {}
f:state: {}
manager: manager
operation: Update
subresource: status
time: "2023-11-20T13:13:44Z"
name: busybox-regional-rbd-deploy-drpc
namespace: busybox-regional-rbd-deploy
ownerReferences:
- apiVersion: work.open-cluster-management.io/v1
kind: AppliedManifestWork
name: da6717d4434fc933ac3d041c0fe2591a3f6eb404c56acf93a31ce29681455949-busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mw
uid: 35837d74-d9b2-49d9-be70-1b2cb8db754a
resourceVersion: "18956"
uid: 1527508b-cbaf-4edd-9ec1-83758d7c466a
spec:
action: Failover
async:
replicationClassSelector: {}
schedulingInterval: 1m
volumeSnapshotClassSelector: {}
pvcSelector:
matchLabels:
appname: busybox
replicationState: primary
s3Profiles:
- minio-on-dr1
- minio-on-dr2
volSync: {}
status:
conditions:
- lastTransitionTime: "2023-11-20T13:13:44Z"
message: PVCs in the VolumeReplicationGroup are ready for use
observedGeneration: 1
reason: Ready
status: "True"
type: DataReady
- lastTransitionTime: "2023-11-20T13:13:20Z"
message: VolumeReplicationGroup is replicating
observedGeneration: 1
reason: Replicating
status: "False"
type: DataProtected
- lastTransitionTime: "2023-11-20T13:12:55Z"
message: Restored cluster data
observedGeneration: 1
reason: Restored
status: "True"
type: ClusterDataReady
- lastTransitionTime: "2023-11-20T13:13:20Z"
message: Cluster data of one or more PVs are unprotected
observedGeneration: 1
reason: UploadError
status: "False"
type: ClusterDataProtected
kubeObjectProtection: {}
lastUpdateTime: "2023-11-20T13:13:44Z"
observedGeneration: 1
protectedPVCs:
- accessModes:
- ReadWriteOnce
conditions:
- lastTransitionTime: "2023-11-20T13:13:20Z"
message: PVC in the VolumeReplicationGroup is ready for use
observedGeneration: 1
reason: Ready
status: "True"
type: DataReady
- lastTransitionTime: "2023-11-20T13:13:07Z"
message: |-
error uploading PV to s3Profile minio-on-dr1, failed to protect cluster data for PVC busybox-pvc, failed to upload data of bucket:busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc/v1.PersistentVolume/pvc-410f8000-67d6-49bb-8574-cb57c6b4f13c, RequestError: send request failed
caused by: Put "http://192.168.122.208:30000/bucket/busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc/v1.PersistentVolume/pvc-410f8000-67d6-49bb-8574-cb57c6b4f13c": dial tcp 192.168.122.208:30000: connect: no route to host
observedGeneration: 1
reason: UploadError
status: "False"
type: ClusterDataProtected
- lastTransitionTime: "2023-11-20T13:13:20Z"
message: PVC in the VolumeReplicationGroup is ready for use
observedGeneration: 1
reason: Replicating
status: "False"
type: DataProtected
csiProvisioner: rook-ceph.rbd.csi.ceph.com
labels:
app: busybox-regional-rbd-deploy
app.kubernetes.io/part-of: busybox-regional-rbd-deploy
appname: busybox
ramendr.openshift.io/owner-name: busybox-regional-rbd-deploy-drpc
ramendr.openshift.io/owner-namespace-name: busybox-regional-rbd-deploy
name: busybox-pvc
namespace: busybox-regional-rbd-deploy
replicationID:
id: ""
resources:
requests:
storage: 1Gi
storageClassName: rook-ceph-block
storageID:
id: ""
state: Primary
There is no visibility on cluster status in drclusters:
$ kubectl get drcluster --context hub -o yaml
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
kind: DRCluster
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"ramendr.openshift.io/v1alpha1","kind":"DRCluster","metadata":{"annotations":{},"name":"dr1"},"spec":{"region":"west","s3ProfileName":"minio-on-dr1"}}
creationTimestamp: "2023-11-20T13:03:21Z"
finalizers:
- drclusters.ramendr.openshift.io/ramen
generation: 1
labels:
cluster.open-cluster-management.io/backup: resource
name: dr1
resourceVersion: "8064"
uid: 2e1cfaec-d6b3-4b46-bca6-058281ca285f
spec:
region: west
s3ProfileName: minio-on-dr1
status:
conditions:
- lastTransitionTime: "2023-11-20T13:03:21Z"
message: Cluster Clean
observedGeneration: 1
reason: Clean
status: "False"
type: Fenced
- lastTransitionTime: "2023-11-20T13:03:21Z"
message: Cluster Clean
observedGeneration: 1
reason: Clean
status: "True"
type: Clean
- lastTransitionTime: "2023-11-20T13:03:22Z"
message: Validated the cluster
observedGeneration: 1
reason: Succeeded
status: "True"
type: Validated
phase: Available
- apiVersion: ramendr.openshift.io/v1alpha1
kind: DRCluster
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"ramendr.openshift.io/v1alpha1","kind":"DRCluster","metadata":{"annotations":{},"name":"dr2"},"spec":{"region":"east","s3ProfileName":"minio-on-dr2"}}
creationTimestamp: "2023-11-20T13:03:21Z"
finalizers:
- drclusters.ramendr.openshift.io/ramen
generation: 1
labels:
cluster.open-cluster-management.io/backup: resource
name: dr2
resourceVersion: "8071"
uid: f6b37742-35b9-4738-96e6-8a920739b9fc
spec:
region: east
s3ProfileName: minio-on-dr2
status:
conditions:
- lastTransitionTime: "2023-11-20T13:03:21Z"
message: Cluster Clean
observedGeneration: 1
reason: Clean
status: "False"
type: Fenced
- lastTransitionTime: "2023-11-20T13:03:21Z"
message: Cluster Clean
observedGeneration: 1
reason: Clean
status: "True"
type: Clean
- lastTransitionTime: "2023-11-20T13:03:22Z"
message: Validated the cluster
observedGeneration: 1
reason: Succeeded
status: "True"
type: Validated
phase: Available
kind: List
metadata:
resourceVersion: ""
So far we tested disable DR when both primary and secondary clusters are up. In disaster use case we may need to disable DR when the one of the clusters is not responsive. In this case we may not be able to clean up a cluster or even get the status of the cluster using
ManagedClusterView
.Simulating non responsive cluster is easy with virsh:
Recover a cluster:
Tested during failover, suspend cluster before failover, resume after application running on the failover cluster.
Fix
Support marking a drcluster as unavailable. When cluster is unavailable:
Recommended flow
Alternative flow
It the user will forget to mark a cluster as unavailable before disabling DR, disable dr will be stuck:
Marking the cluster as unavailable should fix the issue but may require more manual work.
Issues:
Tasks
Similar k8s flows: