RamenDR / ramen

Apache License 2.0
74 stars 56 forks source link

Propagate VR condition error message to protected pvc conditions #1639

Closed nirs closed 1 week ago

nirs commented 2 weeks ago

When a VR condition is not met, we set the protected PVC condition message using the error message returned from isVRConditionMet(). When using csi-addons > 0.10.0, we use now the message from the condition instead of the default message.

Since the Validated condition is not reported by older version of csi-addons, and we must wait until the Validated condition status is known when VRG is deleted, isVRConditionMet() returns now also the state of the condition, which can be:

When we validate the Validate condition we have these cases:

Example protected pvc DataReady condition with propagated message when VR validation failed:

conditions:
  - lastTransitionTime: "2024-11-06T15:33:06Z"
    message: 'failed to meet prerequisite: rpc error: code = FailedPrecondition
      desc = system is not in a state required for the operation''s execution:
      failed to enable mirroring on image "replicapool/csi-vol-fe2ca7f8-713c-4c51-bf52-0d4b2c11d329":
      parent image "replicapool/csi-snap-e2114105-b451-469b-ad97-eb3cbe2af54e"
      is not enabled for mirroring'
    observedGeneration: 1
    reason: Error
    status: "False"
    type: DataReady

[!NOTE] Using development build of csi-addons adding for testing. We don't depend on the csi-addon release to merge this fix, but it will be affective only when using csi-addons including this change: https://github.com/csi-addons/kubernetes-csi-addons/pull/691

nirs commented 2 weeks ago

Works for VR conditions other then Validated. We propagate the messages form the VR conditions:

VR:

  status:
    conditions:
    - lastTransitionTime: "2024-11-05T14:32:43Z"
      message: failed to promote volume
      observedGeneration: 1
      reason: FailedToPromote
      status: "False"
      type: Completed
    - lastTransitionTime: "2024-11-05T14:32:43Z"
      message: failed to enable volume replication
      observedGeneration: 1
      reason: Error
      status: "True"
      type: Degraded
    - lastTransitionTime: "2024-11-05T14:32:43Z"
      message: volume is not resyncing
      observedGeneration: 1
      reason: NotResyncing
      status: "False"
      type: Resyncing
    - lastTransitionTime: "2024-11-05T14:32:43Z"
      message: 'failed to meet prerequisite: rpc error: code = FailedPrecondition
        desc = system is not in a state required for the operation''s execution: failed
        to enable mirroring on image "replicapool/csi-vol-f4737b6e-eeff-4137-8248-301cf37a3368":
        parent image "replicapool/csi-snap-e7c91292-a272-4278-9ee9-6be7a4c8bfe0" is
        not enabled for mirroring'
      observedGeneration: 1
      reason: PrerequisiteNotMet
      status: "False"
      type: Validated

VRG:

    protectedPVCs:
    - accessModes:
      - ReadWriteOnce
      conditions:
      - lastTransitionTime: "2024-11-05T14:32:43Z"
        message: failed to promote volume
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataReady
      - lastTransitionTime: "2024-11-05T14:32:44Z"
        message: PV cluster data already protected for PVC restored-pvc
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      - lastTransitionTime: "2024-11-05T14:32:44Z"
        message: failed to promote volume
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataProtected

Missing change: when Validated condition is False, we want to set the DataReady condition and DataProtected using the error message from the Validated condition. Currently we use the Validated condition only for checking if the VR is finished and can be removed.

nirs commented 2 weeks ago

Propgartion to protected pvcs message works now for all VR conditions:

    protectedPVCs:
    - accessModes:
      - ReadWriteOnce
      conditions:
      - lastTransitionTime: "2024-11-05T16:36:15Z"
        message: 'failed to meet prerequisite: rpc error: code = FailedPrecondition
          desc = system is not in a state required for the operation''s execution:
          failed to enable mirroring on image "replicapool/csi-vol-348f65fd-c658-4764-b7e7-85c45974e97e":
          parent image "replicapool/csi-snap-1ef6bed0-57e3-458f-8a99-413b823dde59"
          is not enabled for mirroring'
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataReady
      - lastTransitionTime: "2024-11-05T16:36:16Z"
        message: PV cluster data already protected for PVC restored-pvc
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      - lastTransitionTime: "2024-11-05T16:36:15Z"
        message: 'failed to meet prerequisite: rpc error: code = FailedPrecondition
          desc = system is not in a state required for the operation''s execution:
          failed to enable mirroring on image "replicapool/csi-vol-348f65fd-c658-4764-b7e7-85c45974e97e":
          parent image "replicapool/csi-snap-1ef6bed0-57e3-458f-8a99-413b823dde59"
          is not enabled for mirroring'
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataProtected
      csiProvisioner: rook-ceph.rbd.csi.ceph.com
      labels:
        appname: busybox
        ramendr.openshift.io/owner-name: flatten-drpc
        ramendr.openshift.io/owner-namespace-name: ramen-ops
      name: restored-pvc
      namespace: flatten
      replicationID:
        id: ""
      resources:
        requests:
          storage: 1Gi
      storageClassName: rook-ceph-block
      storageID:
        id: rook-ceph-dr1-1

But we have 25 failed unit tests, need to understand why they fail.

nirs commented 2 weeks ago

We don't propagate the protected pvcs conditions to the drpc, so on the hub this does not help to debug the issue.

Maybe we can add list or errors messages from protected pvcs to make it easier to debug.

  status:
    actionDuration: 23.105201755s
    actionStartTime: "2024-11-05T17:02:08Z"
    conditions:
    - lastTransitionTime: "2024-11-05T17:02:01Z"
      message: Initial deployment completed
      observedGeneration: 1
      reason: Deployed
      status: "True"
      type: Available
    - lastTransitionTime: "2024-11-05T17:02:01Z"
      message: Ready
      observedGeneration: 1
      reason: Success
      status: "True"
      type: PeerReady
    - lastTransitionTime: "2024-11-05T17:02:02Z"
      message: VolumeReplicationGroup (ramen-ops/flatten-drpc) on cluster dr1 is reporting
        errors (All PVCs of the VolumeReplicationGroup are not ready) readying workload
        data, retrying till DataReady condition is met
      observedGeneration: 1
      reason: Error
      status: "False"
      type: Protected
    lastKubeObjectProtectionTime: "2024-11-05T17:02:04Z"
    lastUpdateTime: "2024-11-05T17:02:31Z"
    observedGeneration: 1
    phase: Deployed
    preferredDecision:
      clusterName: dr1
      clusterNamespace: dr1
    progression: Completed
    resourceConditions:
      conditions:
      - lastTransitionTime: "2024-11-05T17:02:02Z"
        message: All PVCs of the VolumeReplicationGroup are not ready
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataReady
      - lastTransitionTime: "2024-11-05T17:02:02Z"
        message: All PVCs of the VolumeReplicationGroup are not ready
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataProtected
      - lastTransitionTime: "2024-11-05T17:02:01Z"
        message: Nothing to restore
        observedGeneration: 1
        reason: Restored
        status: "True"
        type: ClusterDataReady
      - lastTransitionTime: "2024-11-05T17:02:04Z"
        message: Cluster data of all PVs are protected. Kube objects protected. Kube
          objects protected
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      resourceMeta:
        generation: 1
        kind: VolumeReplicationGroup
        name: flatten-drpc
        namespace: ramen-ops
        protectedpvcs:
        - restored-pvc
        resourceVersion: "15650"
yati1998 commented 2 weeks ago

@nirs I not very sure on how the final VRG will look like, can you please point out the comment from above?

nirs commented 2 weeks ago

@nirs I not very sure on how the final VRG will look like, can you please point out the comment from above?

This comment show the change in the vrg: https://github.com/RamenDR/ramen/pull/1639#issuecomment-2457352330

yati1998 commented 2 weeks ago

LGTM, expect currently we are just setting error message to dataProtect and dataReady, but based on various conditions from VR, it should be populated and show messages accordingly. @nirs If I am not wrong you are planning to bring this change later on as bug fix right?

nirs commented 2 weeks ago

LGTM, expect currently we are just setting error message to dataProtect and dataReady, but based on various conditions from VR, it should be populated and show messages accordingly. @nirs If I am not wrong you are planning to bring this change later on as bug fix right?

Setting the error message is the purpose of change. In normal condition we know the exact state so using the message from the VR is not very useful. We may simplify the code later to just use the message from the VR.

Another issue duplicating the content of DataReady and DataProtected conditions, which does not seems right, but this is not a new issue, and changing it is not in the scope of this change.

yati1998 commented 2 weeks ago

@nirs can we get one more approval on this and merge it?

nirs commented 1 week ago

@nirs can we get one more approval on this and merge it?

We don't need more approvals. We kept it to give time for more reviewers.