RamenDR / ramen

Apache License 2.0
70 stars 51 forks source link

Update placement decision only once Primary VRG is ready (for relocate) #1446

Closed ShyamsundarR closed 1 month ago

ShyamsundarR commented 1 month ago

In relocate once we mark VRG as primary on preferredCluster (in switchToCluster), we loop out of the reconcile interation as we need to wait for the VRG to be delivered via the ManifestWork and then read back via the ManagedClusterView.

In a subsequent reconcile we detect that VRG is already marked as Primary on the preferredCluster in RunRelocate and process finishing up the relocation.

Because of this flow, switchToCluster never gets to check if VRG has reached Primary state as desired, before the PlacementDecision is updated.

In RunRelocate once we detect vrgExistsAndPrimary we do not check if VRG has reached Primary state or not and enter the ensure routines that update the placement decision.

Due to all this, it is quite possible that VRG is yet to restore the PVC/PV and the application is deployed before the same. This would cause new PVs to be provisioned for the workload resulting in loss of data.

The fix is along the lines of RunFailover, where, if we detect VRG as primary we check the VRGs readiness as primary before updating the placement decision.

An alternate fix was to force entering switchToCluster in a subsequent reconcile such that the switch is handled in one place. This will cause hub recovery cases to fail as, on a hub + managed cluster loss and a subsequent hub recovery, a relocated workload will be stuck in setupRelocation attempting to ensure that the lost (current) Secondary VRG is in the right state.

Signed-off-by: Shyamsundar Ranganathan srangana@redhat.com

ShyamsundarR commented 1 month ago

TODO, test Volsync based workloads

ShyamsundarR commented 1 month ago

TODO, test Volsync based workloads

Done testing volsync, all good.