RamenDR / ramen

Apache License 2.0
74 stars 56 forks source link

Error relocating/failing over an application #1104

Open asn1809 opened 1 year ago

asn1809 commented 1 year ago

With multi-namespace-1053 branch, issues are seen in the failover operation with the below error message: 2023-10-20T17:31:45.783Z ERROR controllers.VolumeReplicationGroup.vrginstance controllers/vrg_volrep.go:1746 Failed to update PersistentVolumeClaim annotation {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary", "pvc": "blr-maj/filebrowser-pvc", "error": "Operation cannot be fulfilled on persistentvolumeclaims \"filebrowser-pvc\": StorageError: invalid object, Code: 4, Key: /kubernetes.io/persistentvolumeclaims/blr-maj/filebrowser-pvc, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 7d5ea473-aeb8-4155-848e-9243ff01534c, UID in object meta: bbfd7c8d-7e4d-4004-9e1c-2697c86ce167"} github.com/ramendr/ramen/controllers.(*VRGInstance).addArchivedAnnotationForPVC /workspace/controllers/vrg_volrep.go:1746 github.com/ramendr/ramen/controllers.(*VRGInstance).uploadPVandPVCtoS3Stores /workspace/controllers/vrg_volrep.go:570 github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileVolRepsAsPrimary /workspace/controllers/vrg_volrep.go:74 github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileAsPrimary /workspace/controllers/volumereplicationgroup_controller.go:902 github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary /workspace/controllers/volumereplicationgroup_controller.go:879 github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG /workspace/controllers/volumereplicationgroup_controller.go:558 github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile /workspace/controllers/volumereplicationgroup_controller.go:455 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235

@hatfieldbrian As discussed, can you please check and help us resolve the issue.

Ramen docker image was build by using below details: UPSTREAM_RAMEN_REPO=https://github.com/hatfieldbrian/ramen.git GIT_TAG=multi-namespace-1053 COMMMIT_ID=ab681c935abbbc09297f7dc2423d85fb2328d635

hatfieldbrian commented 1 year ago

Here is the ramen.log that @asn1809 shared with me on Saturday night.

hatfieldbrian commented 1 year ago

Ramen manager container starts

2023-10-20T11:52:29.058Z    INFO    setup   controllers/ramenconfig.go:62   loading Ramen configuration from    {"file": "/config/ramen_manager_config.yaml"}
2023-10-20T11:52:29.059Z    INFO    setup   controllers/ramenconfig.go:70   s3 profile  {"key": 0, "value": {"s3ProfileName":"site1","s3Bucket":"isf-minio-site1","s3CompatibleEndpoint":"https://isf-minio-ibm-spectrum-fusion-ns.apps.rackae1.mydomain.com","s3Region":"site1","s3SecretRef":{"name":"isf-minio-site2","namespace":"ibm-spectrum-fusion-ns"}}}
2023-10-20T11:52:29.059Z    INFO    setup   controllers/ramenconfig.go:70   s3 profile  {"key": 1, "value": {"s3ProfileName":"site2","s3Bucket":"isf-minio-site2","s3CompatibleEndpoint":"https://isf-minio-ibm-spectrum-fusion-ns.apps.rackae2.mydomain.com","s3Region":"site2","s3SecretRef":{"name":"isf-minio-site2","namespace":"ibm-spectrum-fusion-ns"}}}
I1020 11:52:30.505189       1 request.go:690] Waited for 1.049368431s due to client-side throttling, not priority and fairness, request: GET:https://172.31.0.1:443/apis/tuned.openshift.io/v1?timeout=32s
2023-10-20T11:52:33.561Z    INFO    controller-runtime.metrics  metrics/listener.go:44  Metrics server is starting to listen    {"addr": "127.0.0.1:9289"}
2023-10-20T11:52:33.561Z    INFO    controllers.VolumeReplicationGroup  controllers/volumereplicationgroup_controller.go:62 Adding VolumeReplicationGroup controller
2023-10-20T11:52:33.561Z    INFO    controllers.VolumeReplicationGroup  controllers/ramenconfig.go:86   loading Ramen config file   {"name": "/config/ramen_manager_config.yaml"}
2023-10-20T11:52:33.562Z    INFO    controllers.VolumeReplicationGroup  controllers/volumereplicationgroup_controller.go:101    VolSync disabled; don't own volsync resources
2023-10-20T11:52:33.562Z    INFO    controllers.VolumeReplicationGroup  controllers/volumereplicationgroup_controller.go:110    Kube object protection disabled; don't watch kube objects requests
2023-10-20T11:52:33.562Z    INFO    setup   workspace/main.go:213   starting manager
2023-10-20T11:52:33.562Z    INFO    manager/internal.go:369 Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:9289"}
2023-10-20T11:52:33.562Z    INFO    manager/internal.go:369 Starting server {"kind": "health probe", "addr": "[::]:8081"}
I1020 11:52:34.663115       1 leaderelection.go:248] attempting to acquire leader lease ibm-spectrum-fusion-ns/dr-cluster.ramendr.openshift.io...
I1020 11:52:56.000645       1 leaderelection.go:258] successfully acquired lease ibm-spectrum-fusion-ns/dr-cluster.ramendr.openshift.io
2023-10-20T11:52:56.000Z    DEBUG   events  recorder/recorder.go:103    ramen-dr-cluster-operator-797c68655f-mcb4l_8456dc4e-a7c0-41cc-8b90-857af0398e42 became leader   {"type": "Normal", "object": {"kind":"Lease","namespace":"ibm-spectrum-fusion-ns","name":"dr-cluster.ramendr.openshift.io","uid":"0c698d55-d2ee-4642-965b-4acbda332155","apiVersion":"coordination.k8s.io/v1","resourceVersion":"14033688"}, "reason": "LeaderElection"}
2023-10-20T11:52:56.000Z    INFO    controller/controller.go:186    Starting EventSource    {"controller": "protectedvolumereplicationgrouplist", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ProtectedVolumeReplicationGroupList", "source": "kind source: *v1alpha1.ProtectedVolumeReplicationGroupList"}
2023-10-20T11:52:56.001Z    INFO    controller/controller.go:194    Starting Controller {"controller": "protectedvolumereplicationgrouplist", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ProtectedVolumeReplicationGroupList"}
2023-10-20T11:52:56.000Z    INFO    controller/controller.go:186    Starting EventSource    {"controller": "volumereplicationgroup", "controllerGroup": "ramendr.openshift.io", "controllerKind": "VolumeReplicationGroup", "source": "kind source: *v1alpha1.VolumeReplicationGroup"}
2023-10-20T11:52:56.001Z    INFO    controller/controller.go:186    Starting EventSource    {"controller": "volumereplicationgroup", "controllerGroup": "ramendr.openshift.io", "controllerKind": "VolumeReplicationGroup", "source": "kind source: *v1.PersistentVolumeClaim"}
2023-10-20T11:52:56.001Z    INFO    controller/controller.go:186    Starting EventSource    {"controller": "volumereplicationgroup", "controllerGroup": "ramendr.openshift.io", "controllerKind": "VolumeReplicationGroup", "source": "kind source: *v1.PersistentVolumeClaim"}
2023-10-20T11:52:56.001Z    INFO    controller/controller.go:186    Starting EventSource    {"controller": "volumereplicationgroup", "controllerGroup": "ramendr.openshift.io", "controllerKind": "VolumeReplicationGroup", "source": "kind source: *v1.ConfigMap"}
2023-10-20T11:52:56.001Z    INFO    controller/controller.go:186    Starting EventSource    {"controller": "volumereplicationgroup", "controllerGroup": "ramendr.openshift.io", "controllerKind": "VolumeReplicationGroup", "source": "kind source: *v1alpha1.VolumeReplication"}
2023-10-20T11:52:56.001Z    INFO    controller/controller.go:194    Starting Controller {"controller": "volumereplicationgroup", "controllerGroup": "ramendr.openshift.io", "controllerKind": "VolumeReplicationGroup"}

8 PVCs are created

2023-10-20T11:52:56.103Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:172    Create event for PersistentVolumeClaim
2023-10-20T11:52:56.103Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:172    Create event for PersistentVolumeClaim
2023-10-20T11:52:56.103Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:172    Create event for PersistentVolumeClaim
2023-10-20T11:52:56.103Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:172    Create event for PersistentVolumeClaim
2023-10-20T11:52:56.103Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:172    Create event for PersistentVolumeClaim
2023-10-20T11:52:56.103Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:172    Create event for PersistentVolumeClaim
2023-10-20T11:52:56.103Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:172    Create event for PersistentVolumeClaim
2023-10-20T11:52:56.103Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:172    Create event for PersistentVolumeClaim

More Ramen controllers start and its config is updated

2023-10-20T11:52:58.854Z    INFO    controller/controller.go:228    Starting workers    {"controller": "protectedvolumereplicationgrouplist", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ProtectedVolumeReplicationGroupList", "worker count": 1}
2023-10-20T11:52:58.855Z    INFO    controller/controller.go:228    Starting workers    {"controller": "volumereplicationgroup", "controllerGroup": "ramendr.openshift.io", "controllerKind": "VolumeReplicationGroup", "worker count": 1}
2023-10-20T11:52:58.855Z    INFO    configmap.VolumeReplicationGroup    controllers/volumereplicationgroup_controller.go:137    Update in ramen-dr-cluster-operator-config configuration map

VRG blr-maj/br-maj reconcile starts

2023-10-20T17:31:45.015Z    INFO    controllers.VolumeReplicationGroup  controllers/volumereplicationgroup_controller.go:405    Entering reconcile loop {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83"}
2023-10-20T17:31:45.021Z    INFO    controllers.VolumeReplicationGroup  controllers/volumereplicationgroup_controller.go:537    Recipe  {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "elements": {"PvcSelector":{"LabelSelector":{},"NamespaceNames":["blr-maj"]},"CaptureWorkflow":null,"RecoverWorkflow":null}}
2023-10-20T17:31:45.021Z    INFO    controllers.VolumeReplicationGroup  util/pvcs_util.go:61    Fetching PersistentVolumeClaims {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "pvcSelector": ""}
2023-10-20T17:31:45.021Z    INFO    controllers.VolumeReplicationGroup  util/pvcs_util.go:76    Found 8 PVCs using label selector   {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83"}
2023-10-20T17:31:45.021Z    INFO    controllers.VolumeReplicationGroup  controllers/volumereplicationgroup_controller.go:666    Found PersistentVolumeClaims    {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "count": 0}
2023-10-20T17:31:45.025Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/volumereplicationgroup_controller.go:870    Entering processing VolumeReplicationGroup as Primary   {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary"}

ClusterDataReady is false, so PVs and PVCs are restored from S3. There are 1 of each. PVC is named blr-maj/filebrowser-pvc.

2023-10-20T17:31:45.025Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/volumereplicationgroup_controller.go:610    ClusterDataReady condition  {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary", "status": "Unknown", "reason": "Initializing", "message": "Initializing VolumeReplicationGroup", "observedGeneration": 1, "generation": 1}
2023-10-20T17:31:45.025Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volsync.go:18   VolSync: Restoring VolSync PVs  {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary"}
2023-10-20T17:31:45.025Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volsync.go:21   No RDSpec entries. There are no PVCs to restore {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary"}
2023-10-20T17:31:45.025Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1796  Restoring VolRep PVs and PVCs   {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary"}
2023-10-20T17:31:45.025Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1806  Restoring PVs and PVCs to this managed cluster. ProfileList: [site1 site2]  {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary"}
2023-10-20T17:31:45.164Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1889  Found 1 PVs in s3 store using profile site1 {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary"}
2023-10-20T17:31:45.168Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:2006  Restored 1 PV for VolRep    {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary"}
2023-10-20T17:31:45.172Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1910  Found 1 PVCs in s3 store using profile site1    {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary"}
2023-10-20T17:31:45.179Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:172    Create event for PersistentVolumeClaim
2023-10-20T17:31:45.179Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:297    Found VolumeReplicationGroup with matching labels   {"pvc": "blr-maj/filebrowser-pvc", "vrg": "blr-maj", "labeled": ""}
2023-10-20T17:31:45.179Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:2006  Restored 1 PVC for VolRep   {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary"}
2023-10-20T17:31:45.179Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1867  Restored 1 PVs and 1 PVCs using profile site1   {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary"}

KubeObjectProtection is disabled in config map and VRG. Some PVC update events enqueue VRG to be reconciled again.

2023-10-20T17:31:45.179Z    INFO    controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_kubeobjects.go:657  Kube object protection  {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary", "disabled": true, "VRG": true, "configMap": true, "for": "recovery"}
2023-10-20T17:31:45.254Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:191    Update event for PersistentVolumeClaim
2023-10-20T17:31:45.254Z    INFO    RDPredicate.RD  controllers/volumereplicationgroup_controller.go:323    Failed to deep copy older MCV
2023-10-20T17:31:45.255Z    INFO    pvcmap.VolumeReplicationGroup   controllers/vrg_volrep.go:368   Skipping handling of VR as PersistentVolumeClaim is not bound   {"pvc": "blr-maj/filebrowser-pvc", "pvcPhase": "Pending"}
2023-10-20T17:31:45.255Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:254    Not Requeuing   {"pvc": "blr-maj/filebrowser-pvc", "oldPVC Phase": "Pending", "newPVC phase": "Pending"}
2023-10-20T17:31:45.255Z    INFO    RDPredicate.RD  controllers/volumereplicationgroup_controller.go:323    Failed to deep copy older MCV
2023-10-20T17:31:45.255Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:191    Update event for PersistentVolumeClaim
2023-10-20T17:31:45.255Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:226    Reconciling due to phase change {"pvc": "blr-maj/filebrowser-pvc", "oldPhase": "Pending", "newPhase": "Bound"}
2023-10-20T17:31:45.255Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:297    Found VolumeReplicationGroup with matching labels   {"pvc": "blr-maj/filebrowser-pvc", "vrg": "blr-maj", "labeled": ""}
2023-10-20T17:31:45.255Z    INFO    pvcmap.VolumeReplicationGroup   controllers/volumereplicationgroup_controller.go:297    Found VolumeReplicationGroup with matching labels   {"pvc": "blr-maj/filebrowser-pvc", "vrg": "blr-maj", "labeled": ""}

VRG controller tries to annotate the PVC blr-maj/filebrowser-pvc that was just restored with volumereplicationgroups.ramendr.openshift.io/vr-archived: archiveV1-<PVC Generation Number>, but it fails because the provided PVC UID does not matche the expected one.

2023-10-20T17:31:45.783Z    ERROR   controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_volrep.go:1746  Failed to update PersistentVolumeClaim annotation   {"VolumeReplicationGroup": "blr-maj/blr-maj", "rid": "5dfe0f02-5483-43f8-a6d8-6c519f883d83", "State": "primary", "pvc": "blr-maj/filebrowser-pvc", "error": "Operation cannot be fulfilled on persistentvolumeclaims \"filebrowser-pvc\": StorageError: invalid object, Code: 4, Key: /kubernetes.io/persistentvolumeclaims/blr-maj/filebrowser-pvc, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 7d5ea473-aeb8-4155-848e-9243ff01534c, UID in object meta: bbfd7c8d-7e4d-4004-9e1c-2697c86ce167"}
github.com/ramendr/ramen/controllers.(*VRGInstance).addArchivedAnnotationForPVC
    /workspace/controllers/vrg_volrep.go:1746
github.com/ramendr/ramen/controllers.(*VRGInstance).uploadPVandPVCtoS3Stores
    /workspace/controllers/vrg_volrep.go:570
github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileVolRepsAsPrimary
    /workspace/controllers/vrg_volrep.go:74
github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileAsPrimary
    /workspace/controllers/volumereplicationgroup_controller.go:902
github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary
    /workspace/controllers/volumereplicationgroup_controller.go:879
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG
    /workspace/controllers/volumereplicationgroup_controller.go:558
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile
    /workspace/controllers/volumereplicationgroup_controller.go:455
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235

When a PVC is created it gets a new UID. The VRG controller gets the PVC from S3 with the UID from the other cluster and stores it in VRGInstance.volRepPVCs:

https://github.com/RamenDR/ramen/blob/f6afc6fed2cfa62c93778f5296a4cf1bb96a1325/controllers/vrg_volrep.go#L1899-L1912

This same PVC is updated with the annotation and then submitted to the API server:

https://github.com/RamenDR/ramen/blob/f6afc6fed2cfa62c93778f5296a4cf1bb96a1325/controllers/vrg_volrep.go#L32-L34 https://github.com/RamenDR/ramen/blob/f6afc6fed2cfa62c93778f5296a4cf1bb96a1325/controllers/vrg_volrep.go#L74 https://github.com/RamenDR/ramen/blob/f6afc6fed2cfa62c93778f5296a4cf1bb96a1325/controllers/vrg_volrep.go#L532 https://github.com/RamenDR/ramen/blob/f6afc6fed2cfa62c93778f5296a4cf1bb96a1325/controllers/vrg_volrep.go#L569

It seems the PVC needs to be read from the API server to get its new UID so it can be updated, or perhaps the UID could be reset. It is not obvious to me why Fusion is encountering this issue and it has not been discovered previously.

@asn1809 presuming you can reproduce this issue, will you please try to do so with the main branch to determine whether this is an issue with PR #1090 ?

ShyamsundarR commented 1 year ago

This commit seems to be the issue? This seems to be changing the work on reference to a copy, which in turn does not gain the changes made by cleanupForRestore (which removes the UID and hence does not use the stale one from the s3 store).

ShyamsundarR commented 1 year ago

This could also be due to the change that ramen restores the PVCs, which was not the case before, and hence causes the UID mismatch. Checking with @raghavendra-talur further on its working with Ceph backends, and why this issue does not crop up.

hatfieldbrian commented 1 year ago

@asn1809 Thank you for discovering and reporting this issue. It should be fixed. However, I wonder what the impact was? From the log it seems to have recovered on the next reconcile. Did the application recover successfully?

asn1809 commented 1 year ago

In the VRG, for the condition ClusterDataProtected, below is the error seen: failed to add archived annotation for PVC (blr-trinity/filebrowser-pvc) with error (failed to update PersistentVolumeClaim (blr-trinity/filebrowser-pvc) annotation (volumereplicationgroups.ramendr.openshift.io/vr-archived) belonging toVolumeReplicationGroup (blr-trinity/blr-trinity), Operation cannot be fulfilled on persistentvolumeclaims "filebrowser-pvc": StorageError: invalid object, Code: 4, Key: /kubernetes.io/persistentvolumeclaims/blr-trinity/filebrowser-pvc, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 0276d756-4bac-4743-bd80-725eb8114cc8, UID in object meta: 3e978ee0-3058-46dc-a080-ac25d72a5ec2)

asn1809 commented 1 year ago

@asn1809 Thank you for discovering and reporting this issue. It should be fixed. However, I wonder what the impact was? From the log it seems to have recovered on the next reconcile. Did the application recover successfully?

The impact to fusion is even though there might be recovery as you mentioned, it is not properly reflected in VRG and there by in the Application CR and UI reading it.

hatfieldbrian commented 1 year ago

From @pdumbre this morning. 4 protected PVCs with same name, none with namespace, and generation is 1. One theory is that the unconditional append on restore is doing it. Maybe restore fails 3 times, each time after adding the PVC from status.

apiVersion: ramendr.openshift.io/v1alpha1
kind: VolumeReplicationGroup
metadata:
  creationTimestamp: '2023-11-01T10:56:17Z'
  finalizers:
    - volumereplicationgroups.ramendr.openshift.io/vrg-protection
  generation: 1
  managedFields:
    - apiVersion: ramendr.openshift.io/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        'f:metadata':
          'f:finalizers':
            .: {}
            'v:"volumereplicationgroups.ramendr.openshift.io/vrg-protection"': {}
        'f:spec':
          .: {}
          'f:pvcSelector': {}
          'f:replicationState': {}
          'f:s3Profiles': {}
          'f:sync': {}
          'f:volSync':
            .: {}
            'f:disabled': {}
      manager: Mozilla
      operation: Update
      time: '2023-11-01T10:56:17Z'
    - apiVersion: ramendr.openshift.io/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          .: {}
          'f:conditions': {}
          'f:kubeObjectProtection': {}
          'f:lastUpdateTime': {}
          'f:observedGeneration': {}
          'f:protectedPVCs': {}
          'f:state': {}
      manager: manager
      operation: Update
      subresource: status
      time: '2023-11-02T12:21:43Z'
  name: shio
  namespace: shio
  resourceVersion: '8460539'
  uid: 1e6e2579-80c5-4154-bb6f-5fd4777aecbe
spec:
  pvcSelector: {}
  replicationState: primary
  s3Profiles:
    - site2
    - site1
  sync: {}
  volSync:
    disabled: true
status:
  conditions:
    - lastTransitionTime: '2023-11-01T10:56:18Z'
      message: PVCs in the VolumeReplicationGroup are ready for use
      observedGeneration: 1
      reason: Ready
      status: 'True'
      type: DataReady
    - lastTransitionTime: '2023-11-01T10:56:18Z'
      message: VolumeReplicationGroup is replicating
      observedGeneration: 1
      reason: Replicating
      status: 'False'
      type: DataProtected
    - lastTransitionTime: '2023-11-01T10:56:17Z'
      message: Restored cluster data
      observedGeneration: 1
      reason: Restored
      status: 'True'
      type: ClusterDataReady
    - lastTransitionTime: '2023-11-01T10:56:18Z'
      message: Cluster data of one or more PVs are in the process of being protected
      observedGeneration: 1
      reason: Uploading
      status: 'False'
      type: ClusterDataProtected
  kubeObjectProtection: {}
  lastUpdateTime: '2023-11-02T12:21:43Z'
  observedGeneration: 1
  protectedPVCs:
    - conditions:
        - lastTransitionTime: '2023-11-01T10:56:17Z'
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Ready
          status: 'True'
          type: DataReady
        - lastTransitionTime: '2023-11-01T10:56:17Z'
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Replicating
          status: 'False'
          type: DataProtected
      name: br-pvc
      resources: {}
    - conditions:
        - lastTransitionTime: '2023-11-02T03:05:54Z'
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Ready
          status: 'True'
          type: DataReady
        - lastTransitionTime: '2023-11-02T03:05:54Z'
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Replicating
          status: 'False'
          type: DataProtected
      name: br-pvc
      resources: {}
    - conditions:
        - lastTransitionTime: '2023-11-02T05:05:35Z'
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Ready
          status: 'True'
          type: DataReady
        - lastTransitionTime: '2023-11-02T05:05:35Z'
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Replicating
          status: 'False'
          type: DataProtected
      name: br-pvc
      resources: {}
    - conditions:
        - lastTransitionTime: '2023-11-02T12:21:43Z'
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Ready
          status: 'True'
          type: DataReady
        - lastTransitionTime: '2023-11-02T12:21:43Z'
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Replicating
          status: 'False'
          type: DataProtected
      name: br-pvc
      resources: {}
  state: Primary
hatfieldbrian commented 1 year ago

@asn1809 Please add these two list type and map lines to the volumereplicationgroups_types.go file directly above the ProtectedPVCs slice definition:

    //+listType=map
    //+listMapKey=name
    // All the protected pvcs
    ProtectedPVCs []ProtectedPVC `json:"protectedPVCs,omitempty"`

Then run make manifests to generate a new CRD yaml file. This leverages the API server admission control to effectively treat the slice as a map and prevent more than one entry with the same name. It should get us to the reconcile when the first duplicate entry is attempted to be added. This patch is not meant to be promoted since the multi-namespace feature allows duplicate PVCs with the same name as long as they are in different namespaces. It would require the namespace and name fields be joined.