RamenDR / ramen

Apache License 2.0
74 stars 57 forks source link

Recovery workflow must be present to be executed #861

Open hatfieldbrian opened 1 year ago

hatfieldbrian commented 1 year ago

Problem

In general, it is expected that a DR recipe is written by an application developer and included as part of the application. To failover an application, creating a VRG used to be sufficient; however, now that a recipe is external to a VRG, it too must be present for its recovery workflow to be executed. If VRG references a recipe and it is present in API server, use it. Otherwise, it needs to be recovered somehow.

Approaches

Approach 1 - A bootstrap VRG

A bootstrap VRG without a recipe captures and recovers the recipe. Create it first, then, recover the rest of the namespace with a second VRG. This "pre-failover" VRG could potentially be expanded to include other "harmless" things recovered at some frequency with the "merge" option to keep a warm-standby. However, if not used for a warm-standby, this is less-usable as the one-step operation of creating a VRG to do recovery. Another VRG must be created and its status must be monitored for completion, then the normal VRG can be created.

Approach 2 - Ramen protects and recovers Recipe

If a VRG specifies a recipe ref, then Ramen uploads the referenced recipe at the beginning of a Kube objects capture to S3 in either Ramen native format that is used for PV, PVC, VRG or the Velero format. Preferably the former since its upload and download interfaces are blocking which are simpler to implement and likely a little faster than waiting for an event and re-reconciling. The recipe CRD should also be packaged with Ramen which already must be present for VRG CRD and VRG controller. This solution also addresses the issue of tolerating recipe changes mid-workflow

ShyamsundarR commented 1 year ago

@hatfieldbrian I think approach 2 was what was discussed, approach 1 for warm standby and such cases seem a bit convoluted at present.

The observation that Ramen needs to package the CRD though may not hold. We only need to read the Recipe, back it up, restore the Recipe from backup. The CRD should be extended and present on the API server prior to the same, IOW Ramen requiring to package the Recipe CRD does not seem to be a requirement, what am I missing?

hatfieldbrian commented 1 year ago

@ShyamsundarR I agree approach 1 is convoluted and prefer approach 2.

Regarding the Recipe CRD installation, I presumed it would be included in the Ramen package. If it is not, then that is one more thing that is not automated and could go wrong during recovery. If Ramen were to use Velero to protect Recipe instances, the Recipe CRD would also be protected and restored automatically.

A third approach requires the "user" supply the Recipe CRD and Recipe they reference in the VRG they create.

cc: @pmuench1717

hatfieldbrian commented 1 year ago

Taking approach 2 and storing Recipe replica in S3 at capture start time and using it for recovery and capture.

In other words, VRG Recipe reference is only dereferenced at capture start time for workflows. Recipe's reference may be dereferenced at other times for its volume label selector. May optimize VRG dereferences to one per reconcile in this issue.

hatfieldbrian commented 1 year ago

Labeled as medium priority because this impacts good path failover and failback. Workaround is for user to replicate recipe, but the purpose of Ramen is to orchestrate and automate disaster recovery. Recipe replication should be included in automation.

hatfieldbrian commented 1 year ago

PR #1090 introduces recipe parameters. The recovery workflow needs to have its parameters expanded. The values should be those at capture time. A ProtectedVolumeReplicationGroupList lists the latest VRG which may be newer than the latest backup.