RamenDR / ramen

Apache License 2.0
73 stars 53 forks source link

Unplanned failover results into some pvc condition in wrong condition #1419

Closed asn1809 closed 3 months ago

asn1809 commented 4 months ago

Steps that were followed:

  1. Create around 200 applications on site1
  2. Enroll/Protect the applications on site1
  3. Simulate unplanned failover from site1 to site2 by bringing down site1 or breaking network connection between the sites.
  4. Restore back the site1

Desired Result: ‘ClusterDataProtected’ in the VRGs should should be Uploaded without any error condition.

Actual Result: For some of the VRGs, ‘ClusterDataProtected’ condition still shows ‘Upload Error’

Logs from the ramen pod:

2024-05-17T08:52:12.956Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/volumereplicationgroup_controller.go:905    VRG's ClusterDataReady condition found. PV restore must have already been applied   {"VolumeReplicationGroup": {"name":"ae7-fb-4","namespace":"ae7-fb-4"}, "rid": "87928aef-fc07-4304-ad9c-d16e71d9b21b", "State": "primary"}
2024-05-17T08:52:12.956Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:560   PV cluster data already protected for PVC   {"VolumeReplicationGroup": {"name":"ae7-fb-4","namespace":"ae7-fb-4"}, "rid": "87928aef-fc07-4304-ad9c-d16e71d9b21b", "State": "primary", "PVC": "filebrowser-pvc"}
2024-05-17T08:52:12.956Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:92    Successfully processed VolumeReplication for PersistentVolumeClaim  {"VolumeReplicationGroup": {"name":"ae7-fb-4","namespace":"ae7-fb-4"}, "rid": "87928aef-fc07-4304-ad9c-d16e71d9b21b", "State": "primary", "pvc": "ae7-fb-4/filebrowser-pvc"}
2024-05-17T08:52:12.956Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_kubeobjects.go:656  Kube object protection  {"VolumeReplicationGroup": {"name":"ae7-fb-4","namespace":"ae7-fb-4"}, "rid": "87928aef-fc07-4304-ad9c-d16e71d9b21b", "State": "primary", "disabled": true, "VRG": true, "configMap": true, "for": "capture"}
2024-05-17T08:52:12.956Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_vrgobject.go:21 VRG resource version unchanged, skip S3 upload  {"VolumeReplicationGroup": {"name":"ae7-fb-4","namespace":"ae7-fb-4"}, "rid": "87928aef-fc07-4304-ad9c-d16e71d9b21b", "State": "primary", "version": "25021150"}
2024-05-17T08:52:12.956Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:2276  Condition for DataReady {"VolumeReplicationGroup": {"name":"ae7-fb-4","namespace":"ae7-fb-4"}, "rid": "87928aef-fc07-4304-ad9c-d16e71d9b21b", "State": "primary", "cond": "&Condition{Type:DataReady,Status:True,ObservedGeneration:1,LastTransitionTime:2024-05-16 16:29:35 +0000 UTC,Reason:Ready,Message:PVC in the VolumeReplicationGroup is ready for use,}", "protectedPVC": {"namespace":"ae7-fb-4","name":"filebrowser-pvc","storageID":{"id":""},"replicationID":{"id":""},"resources":{},"conditions":[{"type":"DataReady","status":"True","observedGeneration":1,"lastTransitionTime":"2024-05-16T16:29:35Z","reason":"Ready","message":"PVC in the VolumeReplicationGroup is ready for use"},{"type":"DataProtected","status":"False","observedGeneration":1,"lastTransitionTime":"2024-05-16T16:29:35Z","reason":"Replicating","message":"PVC in the VolumeReplicationGroup is ready for use"},{"type":"ClusterDataProtected","status":"False","observedGeneration":1,"lastTransitionTime":"2024-05-16T16:31:38Z","reason":"UploadError","message":"error uploading PV to s3Profile site1, failed to protect cluster data for PVC filebrowser-pvc, failed to upload data of isf-minio-site1:ae7-fb-4/ae7-fb-4/v1.PersistentVolume/pvc-5e0b6858-3fc9-485e-9d20-1ad279c150fe: code: SerializationError, message: failed to unmarshal error message"}]}}
ShyamsundarR commented 4 months ago

The only reason I can think of is as below:

From a fix POV, adding v.updatePVCClusterDataProtectedCondition(pvc.Namespace, pvc.Name,VRGConditionReasonUploaded, msg) right after isArchivedAlready reports a success would help clear out any stale stashed condition.