RamenDR / ramen

Apache License 2.0
73 stars 53 forks source link

Failover fails due to unavailable s3 bucket #533

Open gauthiersiri opened 2 years ago

gauthiersiri commented 2 years ago

Hi Folks,

Following https://red-hat-storage.github.io/ocs-training/training/ocs4/odf410-multisite-ramen.html, i build a platform to test Ramen with RHACM (so with OCP/ODF on each managed cluster and S3 buckets hosted as ODF Nooba buckets on each clusters).

One of my scenario is to power off the primary cluster and trigger the failover.

Sometimes it works... sometimes not... looking at the logs, it looks like Ramen is trying to reach the S3 bucket on the primary site, which is off, and the failover won't proceed (i waited a few hours in case, but nothing happens). Until I power on the primary cluster again, which defeat the purpose of DR :D I was expecting that if the first bucket fails, it would try the second one, or even try to reach to the "local" (as defined in the DRPolicy) bucket for the secondary cluster.

Did I miss something?

Note: to bypass the problem, i've tried with 2 buckets in the cloud (or at least no host hosted on any of the OCP clusters) and so far it works.

For reference, here is the error i got on the secondary sites: image

Thanks!

ShyamsundarR commented 2 years ago

Apologies for a late response.

The expectation is correct. The PVs should be restored from the alternate surviving instance. We did have some issues based on the number of PVCs in the workload and the time taken for the S3 connection to fail, which were fixed recently.

The issue you pasted is an upload failure, which occurs once a failover is complete, at which point the VRG on the surviving instance would attempt to protect the PVs again to both s3 stores, and keep trying till it is uploaded to both. It would report is status.Condition[].Type ClusterDataReady as false till it can protect the workload again.

Hence a restart of the failed cluster is when this would suceed.

This should not result in failure to start the workload on the surviving cluster though. Were the pods/PVCs present on the failoverCluster and the VRG reporting errors, or nothing came up on the failoverCluster.

As we do not have releases, I am unsure which version you are running. It would help if we got the pod logs of the hub and the dr-cluster operators (typically in the ramen-system namespace, or as you are using ODF possibly in the openshift-operators and openshift-dr-cluster namespace). It would also help to understand state of resources like the PlacementRule on the hub cluster to troubleshoot this further.

gauthiersiri commented 2 years ago

Hi @ShyamsundarR , no worries :)

For the version, i'm running the version provided by the Openshift DR Hub Operator (v4.10.5) so i can't really say...

Thn, regarding the workload, the problem is that it actually doesn't start until the restart of the failed cluster.

I will try to reproduce the issue and provide logs (in the hurry in didn't collect them, just kept trace of the vrg error :) )

ShyamsundarR commented 2 years ago

For the version, i'm running the version provided by the Openshift DR Hub Operator (v4.10.5) so i can't really say...

I would suggest v4.11.0 as it carries a lot more fixes for the component, if you are retesting.