RamenDR / ramen

Apache License 2.0
70 stars 51 forks source link

Simplify disable DR to single step #1469

Closed nirs closed 6 days ago

nirs commented 2 weeks ago

Instead of manual steps, ramen modifies the Plagement[Rule] to make it safe after disabling DR.

Changes:

Tested:

simplify-disable-dr.tar.gz

Not tested:

Related ocm-ramen-samples changes:

Fixes #1441

nirs commented 2 weeks ago

Testing disable DR after enabling DR

$ basic-test/deploy -c configs/deployment-k8s-regional-rbd.yaml envs/regional-dr.yaml 
2024-06-24 00:11:14,358 INFO    [deploy] Deploying application
2024-06-24 00:11:14,358 INFO    [deploy] Deploying application 'deployment-rbd'
2024-06-24 00:11:15,627 INFO    [deploy] Waiting for 'placement.cluster.open-cluster-management.io/placement' decisions
2024-06-24 00:11:15,919 INFO    [deploy] Application running on cluster 'dr1'

$ basic-test/enable-dr -c configs/deployment-k8s-regional-rbd.yaml envs/regional-dr.yaml 
2024-06-24 00:11:26,165 INFO    [enable-dr] Enable DR
2024-06-24 00:11:26,240 INFO    [enable-dr] Disabling OCM scheduling for 'placement.cluster.open-cluster-management.io/placement'
2024-06-24 00:11:26,434 INFO    [enable-dr] Waiting for 'placement.cluster.open-cluster-management.io/placement' decisions
2024-06-24 00:11:26,853 INFO    [enable-dr] waiting for namespace deployment-rbd
2024-06-24 00:11:27,028 INFO    [enable-dr] Waiting until 'drplacementcontrol.ramendr.openshift.io/deployment-rbd-drpc' reports status
2024-06-24 00:11:27,759 INFO    [enable-dr] Waiting for 'drplacementcontrol.ramendr.openshift.io/deployment-rbd-drpc' Available condition
2024-06-24 00:11:27,997 INFO    [enable-dr] Waiting for 'drplacementcontrol.ramendr.openshift.io/deployment-rbd-drpc' PeerReady condition
2024-06-24 00:11:28,229 INFO    [enable-dr] Waiting for 'drplacementcontrol.ramendr.openshift.io/deployment-rbd-drpc' first replication
2024-06-24 00:12:57,207 INFO    [enable-dr] DR enabled

$ kubectl gather --contexts hub,dr1,dr2 -n deployment-rbd -d gather.after-enable-dr
2024-06-24T00:17:29.663+0300    INFO    gather  Using kubeconfig "/home/nsoffer/.kube/config"
2024-06-24T00:17:29.666+0300    INFO    gather  Gathering from namespaces ["deployment-rbd"]
2024-06-24T00:17:29.667+0300    INFO    gather  Using all addons
2024-06-24T00:17:29.667+0300    INFO    gather  Gathering from cluster "hub"
2024-06-24T00:17:29.667+0300    INFO    gather  Gathering from cluster "dr1"
2024-06-24T00:17:29.667+0300    INFO    gather  Gathering from cluster "dr2"
2024-06-24T00:17:29.682+0300    INFO    gather  Gathered 0 resources from cluster "dr2" in 0.015 seconds
2024-06-24T00:17:29.836+0300    INFO    gather  Gathered 18 resources from cluster "hub" in 0.169 seconds
2024-06-24T00:17:29.838+0300    INFO    gather  Gathered 23 resources from cluster "dr1" in 0.171 seconds
2024-06-24T00:17:29.838+0300    INFO    gather  Gathered 41 resources from 3 clusters in 0.171 seconds

$ basic-test/disable-dr -c configs/deployment-k8s-regional-rbd.yaml envs/regional-dr.yaml 
2024-06-24 00:17:49,307 INFO    [disable-dr] Disable DR
2024-06-24 00:17:49,385 INFO    [disable-dr] Deleting 'drplacementcontrol.ramendr.openshift.io/deployment-rbd-drpc'
2024-06-24 00:17:57,299 INFO    [disable-dr] DR was disabled

$ kubectl gather --contexts hub,dr1,dr2 -n deployment-rbd -d gather.after-enable-dr
2024-06-24T00:17:29.663+0300    INFO    gather  Using kubeconfig "/home/nsoffer/.kube/config"
2024-06-24T00:17:29.666+0300    INFO    gather  Gathering from namespaces ["deployment-rbd"]
2024-06-24T00:17:29.667+0300    INFO    gather  Using all addons
2024-06-24T00:17:29.667+0300    INFO    gather  Gathering from cluster "hub"
2024-06-24T00:17:29.667+0300    INFO    gather  Gathering from cluster "dr1"
2024-06-24T00:17:29.667+0300    INFO    gather  Gathering from cluster "dr2"
2024-06-24T00:17:29.682+0300    INFO    gather  Gathered 0 resources from cluster "dr2" in 0.015 seconds
2024-06-24T00:17:29.836+0300    INFO    gather  Gathered 18 resources from cluster "hub" in 0.169 seconds
2024-06-24T00:17:29.838+0300    INFO    gather  Gathered 23 resources from cluster "dr1" in 0.171 seconds
2024-06-24T00:17:29.838+0300    INFO    gather  Gathered 41 resources from 3 clusters in 0.171 seconds

Comparing placement before/after:

$ diff -u gather.after-enable-dr/hub/namespaces/deployment-rbd/cluster.open-cluster-management.io/placements/placement.yaml gather.after-disable-dr/hub/namespaces/deployment-rbd/cluster.open-cluster-management.io/placements/placement.yaml 
--- gather.after-enable-dr/hub/namespaces/deployment-rbd/cluster.open-cluster-management.io/placements/placement.yaml   2024-06-24 00:17:29.804953284 +0300
+++ gather.after-disable-dr/hub/namespaces/deployment-rbd/cluster.open-cluster-management.io/placements/placement.yaml  2024-06-24 00:18:11.481164101 +0300
@@ -8,9 +8,7 @@
     kubectl.kubernetes.io/last-applied-configuration: |
       {"apiVersion":"cluster.open-cluster-management.io/v1beta1","kind":"Placement","metadata":{"annotations":{},"labels":{"app":"deployment-rbd"},"name":"placement","namespace":"deployment-rbd"},"spec":{"clusterSets":["default"],"numberOfClusters":1}}
   creationTimestamp: "2024-06-23T21:11:15Z"
-  finalizers:
-  - drpc.ramendr.openshift.io/finalizer
-  generation: 2
+  generation: 3
   labels:
     app: deployment-rbd
   managedFields:
@@ -59,25 +57,32 @@
         f:annotations:
           f:drplacementcontrol.ramendr.openshift.io/drpc-name: {}
           f:drplacementcontrol.ramendr.openshift.io/drpc-namespace: {}
-        f:finalizers:
-          .: {}
-          v:"drpc.ramendr.openshift.io/finalizer": {}
       f:spec:
+        f:predicates: {}
         f:prioritizerPolicy:
           .: {}
           f:mode: {}
         f:spreadPolicy: {}
     manager: manager
     operation: Update
-    time: "2024-06-23T21:11:26Z"
+    time: "2024-06-23T21:17:57Z"
   name: placement
   namespace: deployment-rbd
-  resourceVersion: "3604"
+  resourceVersion: "4719"
   uid: 07f54cce-1bcf-4a40-adfc-36d84d894b86
 spec:
   clusterSets:
   - default
   numberOfClusters: 1
+  predicates:
+  - requiredClusterSelector:
+      claimSelector: {}
+      labelSelector:
+        matchExpressions:
+        - key: name
+          operator: In
+          values:
+          - dr1
   prioritizerPolicy:
     mode: Additive
   spreadPolicy: {}

New note when modifying placement predicates:

2024-06-23T21:17:57.237Z        INFO    controllers.DRPlacementControl  util/placement.go:47    NOTE: modifying placement predicates to select current cluster  {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "e75aec3d-963a-4996-8717-bb804cb1c433", "namespace": "deployment-rbd", "placement": "placement", "cluster": "dr1"}
ShyamsundarR commented 1 week ago

@nirs please update envtests for the changes as appropriate.

nirs commented 1 week ago

Testing relocate when predicates differ:

diff -ur 07-relocate/hub/namespaces/deployment-rbd/cluster.open-cluster-management.io/placements/placement.yaml 08-disable-dr/hub/namespaces/deployment-rbd/cluster.open-cluster-management.io/placements/placement.yaml
--- 07-relocate/hub/namespaces/deployment-rbd/cluster.open-cluster-management.io/placements/placement.yaml  2024-06-24 23:42:37.108256298 +0300
+++ 08-disable-dr/hub/namespaces/deployment-rbd/cluster.open-cluster-management.io/placements/placement.yaml    2024-06-24 23:43:15.955457463 +0300
@@ -8,9 +8,7 @@
     kubectl.kubernetes.io/last-applied-configuration: |
       {"apiVersion":"cluster.open-cluster-management.io/v1beta1","kind":"Placement","metadata":{"annotations":{},"labels":{"app":"deployment-rbd"},"name":"placement","namespace":"deployment-rbd"},"spec":{"clusterSets":["default"],"numberOfClusters":1}}
   creationTimestamp: "2024-06-24T20:24:35Z"
-  finalizers:
-  - drpc.ramendr.openshift.io/finalizer
-  generation: 3
+  generation: 4
   labels:
     app: deployment-rbd
   managedFields:
@@ -59,9 +57,6 @@
         f:annotations:
           f:drplacementcontrol.ramendr.openshift.io/drpc-name: {}
           f:drplacementcontrol.ramendr.openshift.io/drpc-namespace: {}
-        f:finalizers:
-          .: {}
-          v:"drpc.ramendr.openshift.io/finalizer": {}
       f:spec:
         f:predicates: {}
         f:prioritizerPolicy:
@@ -70,10 +65,10 @@
         f:spreadPolicy: {}
     manager: manager
     operation: Update
-    time: "2024-06-24T20:35:35Z"
+    time: "2024-06-24T20:43:05Z"
   name: placement
   namespace: deployment-rbd
-  resourceVersion: "29797"
+  resourceVersion: "31125"
   uid: 30758fc5-384b-4bb9-a335-af0743242d04
 spec:
   clusterSets:
@@ -87,7 +82,7 @@
         - key: name
           operator: In
           values:
-          - dr1
+          - dr2
   prioritizerPolicy:
     mode: Additive
   spreadPolicy: {}
Only in 07-relocate/hub/namespaces/deployment-rbd: ramendr.openshift.io

New logs:

2024-06-24T20:43:05.127Z    INFO    controllers.DRPlacementControl  util/placement.go:51    NOTE: modifying placement predicates to select current cluster  {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "8a75b90c-0d40-4107-bcb2-54778a4ce1f7", "namespace": "deployment-rbd", "placement": "placement", "cluster": "dr2"}
nirs commented 1 week ago

Testing disable DR during relocate

application running on dr2, placement pointing to dr1

  1. Start relocate
  2. Watch placementdecisions until it becomes empty
    watch -n 1 -x kubectl get placementdecisions placement-decision-1 -o jsonpath='{.status.decisions}{"\n"}' -n deployment-rbd --context hub
  3. Delete the drpc
    $ kubectl delete drpc deployment-rbd-drpc -n deployment-rbd --context hub
    drplacementcontrol.ramendr.openshift.io "deployment-rbd-drpc" deleted
  4. In ramen log we see:
    2024-06-25T09:06:34.417Z        INFO    controllers.DRPlacementControl  controllers/drplacementcontrol_controller.go:2078       Found ClusterDecision   {"ClsDedicision": []}
    2024-06-25T09:06:34.417Z        INFO    controllers.DRPlacementControl  controllers/drplacementcontrol_controller.go:1953       Using DRPC preferredCluster, Relocate progression detected as switching to preferred cluster        {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "d3b0facf-aee0-42dc-a635-1835cbce2861"}

Placement was not modified since the current value matches the cluster name.

The final result is the that application is not running on any cluster. Not sure if this is the wanted result, but I don't see how we can avoid this. If we delay deletion of the drpc until relocated is completed, it can stuck forever without being able to delete the drpc.

To test manual recovery, I removed the scheduling-disable annotation from the placement, so see if the app will be created on cluster dr1 with the right data.

The app was started on clsuter dr1 with a new pvc:

$ kubectl exec pod/busybox-6bbf88b9f8-fkz8j -n deployment-rbd --context dr1 -- cat /var/log/ramen.log
Tue Jun 25 11:01:07 UTC 2024 START
Tue Jun 25 11:01:17 UTC 2024 UPDATE
Tue Jun 25 11:01:27 UTC 2024 UPDATE
Tue Jun 25 11:01:37 UTC 2024 UPDATE

Changing the placement to cluster dr2, the app was started on cluster dr2, but also using a new pvc:

$ kubectl exec pod/busybox-6bbf88b9f8-kcjn9 -n deployment-rbd --context dr2 -- cat /var/log/ramen.log
Tue Jun 25 11:07:22 UTC 2024 START
Tue Jun 25 11:07:32 UTC 2024 UPDATE
Tue Jun 25 11:07:42 UTC 2024 UPDATE
Tue Jun 25 11:07:52 UTC 2024 UPDATE

So when disabling dr in the middle of relocate we can lose the application data. I opened #1473 to track this issue.

nirs commented 1 week ago

We discussed this in the team meeting, and I think both ways are valid - we can keep disable DR single step (for integration with the UI) by never removing the annotation (or schedulerName for PlacementRule).

To disable DR you just delete the DRPC. You don't need to change the Placement[Rule] since OCM scheduling is always disabled.

If a user want to enable OCM scheduling they either do not care about the data, since OCM does not support stateful application (e.g, moving the the PVs to another cluster when changing the placement). So changing the placement is the user responsibility, not needed in the common case when we disable DR.

I'll add another PR implementing the simpler approach.

nirs commented 1 week ago

Posted simpler alternative based on @BenamarMk suggestion: #1474

nirs commented 6 days ago

Replaced by #1474