Resolving a Ramen Validation Issue in Failover Cluster Post Hub Recovery

BenamarMk commented 1 month ago

Fixing a day-one issue post-Hub Recovery process was recently uncovered due to the recent change where we added a new validation check in Ramen. This validation was designed to validate the failover cluster before initiating the failover procedure.

Before the introduction of this validation, the sync and failover operations functioned correctly post-hub recovery despite this underlying bug. The reason is that the ReplicationDestination existed on the failover cluster. The missing RDSpec had no effect.

I would like to point out that this issue happens only when the primary cluster is inaccessible. Ramen cannot retrieve the VRG from the primary cluster in these situations. As a result, when Ramen regenerates the ManifestWork for the VRG, the RDSpecs are excluded. These RDSpecs are created for each ProtectedPVC object in the primary VRG.

The solution in this PR involves adding a check for the existence of the ManifestWork before creating it, to prevent accidentally overwriting a valid VRG.spec. If a ManifestWork already exists on the destination cluster, its creation is skipped. Additionally, a utility function has been created that returns a map of VRGs retrieved using MCV. If a primary VRG is not found, the function retrieves it from the S3 store and adds it to the map. This new function replaces all occurrences of d.vrgs throughout the codebase, ensuring that DRPC maintains an accurate view of all VRGs in the managed clusters.

https://bugzilla.redhat.com/show_bug.cgi?id=2284021

TODO:

[x] Test the fix end-to-end

BenamarMk commented 1 month ago

@ShyamsundarR unit test and e2e complete. Ready for review and merge by Monday if possible

BenamarMk commented 1 month ago

@BenamarMk let's discuss the changes. Overall we should rely on the S3 version of VRG when MCV version is not found, but I am not sure we can make this the default in all/most cases.

One thing to note; we rely on s3 only when there is no primary.

RamenDR / ramen

Resolving a Ramen Validation Issue in Failover Cluster Post Hub Recovery #1431