Open njjry opened 1 year ago
Adding some details to this, as I encountered it recently.
It appears that if there are no PVCs found by either PVC Label Selectors or Recipe Volume Group, Ramen won't populate the VRG's status.state
with a Primary or Secondary status. That's the main issue - it just looks like the VRG isn't reconciling and leaves status.state
as Unknown
or empty.
However, lacking status.state
does not prevent kubeObjectProtection from taking backups - and if the user fixes the issue (the VRG finds PVCs to protect), it will interpret those existing backups as candidates to restore when Primary status is achieved. That second part is perhaps a "bug of a bug", but those are the circumstance where I experienced this.
I just encountered this during a failover on the "from" side which was cluster6:
$ oc get vrg --context kafka/api-cluster6-local:6443/kube:admin -oyaml bb
apiVersion: ramendr.openshift.io/v1alpha1
kind: VolumeReplicationGroup
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"ramendr.openshift.io/v1alpha1","kind":"VolumeReplicationGroup","metadata":{"annotations":{},"name":"bb","namespace":"kafka"},"spec":{"kubeObjectProtection":{"captureInterval":"1m","recipeRef":{"name":"recipe-kafka"}},"pvcSelector":{"matchLabels":{}},"replicationState":"primary","s3Profiles":["s3profile-cluster6-ocs-external-storagecluster","s3profile-cluster8-ocs-external-storagecluster"],"sync":{}}}
creationTimestamp: "2023-07-25T23:36:42Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2023-07-26T00:17:36Z"
finalizers:
- volumereplicationgroups.ramendr.openshift.io/vrg-protection
generation: 5
name: bb
namespace: kafka
resourceVersion: "308674758"
uid: c33a49c3-87b0-4b49-98fe-a78771285a96
spec:
action: Failover
kubeObjectProtection:
captureInterval: 1m0s
recipeRef:
name: recipe-kafka
pvcSelector: {}
replicationState: secondary
s3Profiles:
- s3profile-cluster6-ocs-external-storagecluster
- s3profile-cluster8-ocs-external-storagecluster
sync: {}
volSync: {}
status:
conditions:
- lastTransitionTime: "2023-07-26T00:17:47Z"
message: Failed to get list of pvcs
observedGeneration: 5
reason: Error
status: "False"
type: DataReady
- lastTransitionTime: "2023-07-26T00:17:30Z"
message: VolumeReplicationGroup is replicating
observedGeneration: 4
reason: Replicating
status: "False"
type: DataProtected
- lastTransitionTime: "2023-07-25T23:36:43Z"
message: Restored cluster data
observedGeneration: 2
reason: Restored
status: "True"
type: ClusterDataReady
- lastTransitionTime: "2023-07-26T00:17:30Z"
message: Cluster data of all PVs are protected
observedGeneration: 4
reason: Uploaded
status: "True"
type: ClusterDataProtected
kubeObjectProtection:
captureToRecoverFrom:
number: 1
startGeneration: 2
startTime: "2023-07-26T00:04:21Z"
lastUpdateTime: "2023-07-26T00:17:47Z"
observedGeneration: 5
protectedPVCs:
- conditions:
- lastTransitionTime: "2023-07-26T00:17:30Z"
message: Secondary transition failed as PersistentVolume for PVC is still attached
to node(s)
observedGeneration: 4
reason: Progressing
status: "False"
type: DataReady
- lastTransitionTime: "2023-07-25T23:36:43Z"
message: PVC in the VolumeReplicationGroup is ready for use
observedGeneration: 2
reason: Replicating
status: "False"
type: DataProtected
- lastTransitionTime: "2023-07-25T23:36:45Z"
message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-cluster6-ocs-external-storagecluster
s3profile-cluster8-ocs-external-storagecluster]'
observedGeneration: 2
reason: Uploaded
status: "True"
type: ClusterDataProtected
name: data-my-cluster-zookeeper-2
replicationID:
id: ""
resources: {}
storageID:
id: ""
- conditions:
- lastTransitionTime: "2023-07-26T00:17:30Z"
message: Secondary transition failed as PersistentVolume for PVC is still attached
to node(s)
observedGeneration: 4
reason: Progressing
status: "False"
type: DataReady
- lastTransitionTime: "2023-07-25T23:36:45Z"
message: PVC in the VolumeReplicationGroup is ready for use
observedGeneration: 2
reason: Replicating
status: "False"
type: DataProtected
- lastTransitionTime: "2023-07-25T23:36:47Z"
message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-cluster6-ocs-external-storagecluster
s3profile-cluster8-ocs-external-storagecluster]'
observedGeneration: 2
reason: Uploaded
status: "True"
type: ClusterDataProtected
name: data-my-cluster-zookeeper-0
replicationID:
id: ""
resources: {}
storageID:
id: ""
- conditions:
- lastTransitionTime: "2023-07-26T00:17:30Z"
message: Secondary transition failed as PVC is potentially in use by a pod
observedGeneration: 4
reason: Progressing
status: "False"
type: DataReady
- lastTransitionTime: "2023-07-25T23:36:47Z"
message: PVC in the VolumeReplicationGroup is ready for use
observedGeneration: 2
reason: Replicating
status: "False"
type: DataProtected
- lastTransitionTime: "2023-07-25T23:36:49Z"
message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-cluster6-ocs-external-storagecluster
s3profile-cluster8-ocs-external-storagecluster]'
observedGeneration: 2
reason: Uploaded
status: "True"
type: ClusterDataProtected
name: data-my-cluster-kafka-1
replicationID:
id: ""
resources: {}
storageID:
id: ""
- conditions:
- lastTransitionTime: "2023-07-26T00:17:30Z"
message: Secondary transition failed as PVC is potentially in use by a pod
observedGeneration: 4
reason: Progressing
status: "False"
type: DataReady
- lastTransitionTime: "2023-07-25T23:36:49Z"
message: PVC in the VolumeReplicationGroup is ready for use
observedGeneration: 2
reason: Replicating
status: "False"
type: DataProtected
- lastTransitionTime: "2023-07-25T23:36:51Z"
message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-cluster6-ocs-external-storagecluster
s3profile-cluster8-ocs-external-storagecluster]'
observedGeneration: 2
reason: Uploaded
status: "True"
type: ClusterDataProtected
name: data-my-cluster-kafka-0
replicationID:
id: ""
resources: {}
storageID:
id: ""
- conditions:
- lastTransitionTime: "2023-07-26T00:17:30Z"
message: Secondary transition failed as PVC is potentially in use by a pod
observedGeneration: 4
reason: Progressing
status: "False"
type: DataReady
- lastTransitionTime: "2023-07-25T23:36:51Z"
message: PVC in the VolumeReplicationGroup is ready for use
observedGeneration: 2
reason: Replicating
status: "False"
type: DataProtected
- lastTransitionTime: "2023-07-25T23:36:53Z"
message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-cluster6-ocs-external-storagecluster
s3profile-cluster8-ocs-external-storagecluster]'
observedGeneration: 2
reason: Uploaded
status: "True"
type: ClusterDataProtected
name: data-my-cluster-kafka-2
replicationID:
id: ""
resources: {}
storageID:
id: ""
- conditions:
- lastTransitionTime: "2023-07-26T00:17:30Z"
message: Secondary transition failed as PersistentVolume for PVC is still attached
to node(s)
observedGeneration: 4
reason: Progressing
status: "False"
type: DataReady
- lastTransitionTime: "2023-07-25T23:36:53Z"
message: PVC in the VolumeReplicationGroup is ready for use
observedGeneration: 2
reason: Replicating
status: "False"
type: DataProtected
- lastTransitionTime: "2023-07-25T23:36:54Z"
message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-cluster6-ocs-external-storagecluster
s3profile-cluster8-ocs-external-storagecluster]'
observedGeneration: 2
reason: Uploaded
status: "True"
type: ClusterDataProtected
name: data-my-cluster-zookeeper-1
replicationID:
id: ""
resources: {}
storageID:
id: ""
state: Unknown
The DataReady
error message "Failed to get list of pvcs" changed to "Failed to process list of PVCs to protect" with https://github.com/RamenDR/ramen/commit/977c263fa7e00f1d27476c4a3f0379ac77f9dc0e on May 9, but the behavior remains the same.
The problem is that the Recipe
that the VRG references is deleted before the VRG is. The VRG controller doesn't know whether the recipe contains a PVC label selector.
2023-07-26T10:51:48.240Z ERROR controllers.VolumeReplicationGroup controllers/volumereplicationgroup_controller.go:301 GetRecipeWithName error: %s-%s {"VolumeReplicationGroup": "kafka/bb", "rid": "77422bac-e47c-41ef-846c-79c759c406ab", "bb": "kafka", "error": "Recipe.ramendr.openshift.io \"recipe-kafka\" not found"}
github.com/ramendr/ramen/controllers.GetPVCLabelSelector
/workspace/controllers/volumereplicationgroup_controller.go:301
github.com/ramendr/ramen/controllers.(*VRGInstance).listPVCsByPVCSelector
/workspace/controllers/volumereplicationgroup_controller.go:636
github.com/ramendr/ramen/controllers.(*VRGInstance).updatePVCList
/workspace/controllers/volumereplicationgroup_controller.go:647
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG
/workspace/controllers/volumereplicationgroup_controller.go:499
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile
/workspace/controllers/volumereplicationgroup_controller.go:404
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235
2023-07-26T10:51:48.240Z ERROR controllers.VolumeReplicationGroup controllers/volumereplicationgroup_controller.go:500 Failed to update PersistentVolumeClaims for resource {"VolumeReplicationGroup": "kafka/bb", "rid": "77422bac-e47c-41ef-846c-79c759c406ab", "error": "Recipe.ramendr.openshift.io \"recipe-kafka\" not found"}
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG
/workspace/controllers/volumereplicationgroup_controller.go:500
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile
/workspace/controllers/volumereplicationgroup_controller.go:404
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235
User could either delete recipe reference or restore the recipe
Recipe may not contain a PVC label selector
Issues #860 and #861 suggest approaches to capture recipe at Kube resources backup time
I recommend approach 1 for the Ramen code. For the failover and failback scripts, I recommend adding a finalizer to the recipe and removing it once the VRG is finally deleted.
Labeled low priority because of workaround to create dummy volume.
I tested CPD ibm-common-services ns failover/failback, there are no PVs in the ns. After creating the vrg, it shows the below status:
Although all the cluster resources have been protected, the state does not go to primary.