dell / csm

Dell Container Storage Modules (CSM)
Apache License 2.0
68 stars 15 forks source link

[BUG]: Replication Failover/Reprotect operations has "Error" under State field in the ReplicationGroup #1445

Closed anandhg02 closed 1 month ago

anandhg02 commented 2 months ago

I had implemented CSM Replication for PowerMAX between 2 OCP clusters. I am using repctl utility for the Replication Failover/Reprotect operations. The replication operations are all working as expected and this can be verified using SRDF. But the ReplicationGroup has the below error in the RG under State field.

[user01@csahn01 csm]$ repctl get rg
[2024-08-30 13:44:29]  INFO listing replication groups
[2024-08-30 13:44:29]  INFO Cluster: ocps1
+-----+
| RG  |
+-----+
Name                                    State   rClusterID      Driver                          RemoteRG                                IsSource        LinkState
rg-5b85f807-2fb2-46a2-af2f-7e3c4541f81d Error   ocps2           csi-powermax.dellemc.com        rg-5b85f807-2fb2-46a2-af2f-7e3c4541f81d false           SUSPENDED
[2024-08-30 13:44:29]  INFO
[2024-08-30 13:44:29]  INFO Cluster: ocps2
+-----+
| RG  |
+-----+
Name                                    State   rClusterID      Driver                          RemoteRG                                IsSource        LinkState
rg-5b85f807-2fb2-46a2-af2f-7e3c4541f81d Error   ocps1           csi-powermax.dellemc.com        rg-5b85f807-2fb2-46a2-af2f-7e3c4541f81d true            SUSPENDED

Getting the describe output of the RG shows the below. I couldn't find the globalID parameter in the protection group attribute. What is causing this error.

[user01@csahn01 csm]$ oc describe rg rg-5b85f807-2fb2-46a2-af2f-7e3c4541f81d
.....
....
  Warning  Error    5m4s                  dell-csi-replicator          Action [FAILOVER_REMOTE] on DellCSIReplicationGroup [rg-5b85f807-2fb2-46a2-af2f-7e3c4541f81d] failed with error [rpc error: code = InvalidArgument desc = missing globalID in protection group attributes]
  Warning  Updated  4m39s (x2 over 5m4s)  dell-replication-controller  failed to process the last action Action FAILOVER_REMOTE failed with error rpc error: code = InvalidArgument desc = missing globalID in protection group attributes
[user01@csahn01 csm]$

Version Details

OCP: v4.14
CSI: v2.9.1
Replication Module: v1.7.1
khareRajshree commented 2 months ago

hi @anandhg02, can you share more details of the storage class used for replication and full result of the command oc describe rg rg-5b85f807-2fb2-46a2-af2f-7e3c4541f81d for RG. Thanks.

anandhg02 commented 2 months ago

Hello Rajshree, Just for testing I had deleted the previous rg and recreated with the new replication and still getting the same error

The rg that I am currently testing the replication operation is rg-47d46288-c551-4e73-b8a9-b41113248b3f.

[corood@csahn01 repctl]$ ./repctl get rg
[2024-09-09 22:25:56]  INFO listing replication groups
[2024-09-09 22:25:56]  INFO Cluster: ocps1
+-----+
| RG  |
+-----+
Name                                    State   rClusterID      Driver                          RemoteRG                                IsSource        LinkState
rg-47d46288-c551-4e73-b8a9-b41113248b3f Error   ocps2           csi-powermax.dellemc.com        rg-47d46288-c551-4e73-b8a9-b41113248b3f true            SYNCHRONIZED
rg-790a9d36-b593-4936-869d-317eb56018b0 Error   ocps2           csi-isilon.dellemc.com          rg-790a9d36-b593-4936-869d-317eb56018b0 false           FAILEDOVER
rg-ca6cc10f-1ac4-43a5-a673-cdeff8a45a17 Ready   ocps2           csi-powermax.dellemc.com        rg-ca6cc10f-1ac4-43a5-a673-cdeff8a45a17 true            SYNCHRONIZED
[2024-09-09 22:25:56]  INFO
[2024-09-09 22:25:56]  INFO Cluster: ocps2
+-----+
| RG  |
+-----+
Name                                    State   rClusterID      Driver                          RemoteRG                                IsSource        LinkState
rg-47d46288-c551-4e73-b8a9-b41113248b3f Error   ocps1           csi-powermax.dellemc.com        rg-47d46288-c551-4e73-b8a9-b41113248b3f false           SYNCHRONIZED
rg-790a9d36-b593-4936-869d-317eb56018b0 Error   ocps1           csi-isilon.dellemc.com          rg-790a9d36-b593-4936-869d-317eb56018b0 true            FAILEDOVER
rg-ca6cc10f-1ac4-43a5-a673-cdeff8a45a17 Ready   ocps1           csi-powermax.dellemc.com        rg-ca6cc10f-1ac4-43a5-a673-cdeff8a45a17 false           SYNCHRONIZED

Requested output for the command oc describe rg

[corood@csahn01 repctl]$ oc describe rg rg-47d46288-c551-4e73-b8a9-b41113248b3f
Name:         rg-47d46288-c551-4e73-b8a9-b41113248b3f
Namespace:
Labels:       replication.storage.dell.com/RdfGroup=12
              replication.storage.dell.com/RdfMode=SYNC
              replication.storage.dell.com/RemoteRDFGroup=12
              replication.storage.dell.com/RemoteSYMID=000220002171
              replication.storage.dell.com/SYMID=000220002131
              replication.storage.dell.com/driverName=csi-powermax.dellemc.com
              replication.storage.dell.com/remoteClusterID=ocps2
Annotations:  Action:
                {"name":"REPROTECT_LOCAL","completed":true,"finalError":"rpc error: code = InvalidArgument desc = missing globalID in protection group att...
              replication.storage.dell.com/actionProcessedTime:
              replication.storage.dell.com/contextPrefix: powermax
              replication.storage.dell.com/remoteClusterID: ocps2
              replication.storage.dell.com/remoteRGRetentionPolicy: delete
              replication.storage.dell.com/remoteReplicationGroupName: rg-47d46288-c551-4e73-b8a9-b41113248b3f
              replication.storage.dell.com/rg_sync_complete: yes
API Version:  replication.storage.dell.com/v1
Kind:         DellCSIReplicationGroup
Metadata:
  Creation Timestamp:  2024-09-01T00:38:47Z
  Finalizers:
    replication.storage.dell.com/replicationProtection
    replication.storage.dell.com/replicationSyncProtection
  Generation:  5
  Managed Fields:
    API Version:  replication.storage.dell.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:replication.storage.dell.com/remoteReplicationGroupName:
          f:replication.storage.dell.com/rg_sync_complete:
        f:finalizers:
          v:"replication.storage.dell.com/replicationSyncProtection":
    Manager:      dell-replication-controller
    Operation:    Update
    Time:         2024-09-01T00:38:47Z
    API Version:  replication.storage.dell.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:Action:
          f:replication.storage.dell.com/actionProcessedTime:
          f:replication.storage.dell.com/contextPrefix:
          f:replication.storage.dell.com/remoteClusterID:
          f:replication.storage.dell.com/remoteRGRetentionPolicy:
        f:finalizers:
          .:
          v:"replication.storage.dell.com/replicationProtection":
        f:labels:
          .:
          f:replication.storage.dell.com/RdfGroup:
          f:replication.storage.dell.com/RdfMode:
          f:replication.storage.dell.com/RemoteRDFGroup:
          f:replication.storage.dell.com/RemoteSYMID:
          f:replication.storage.dell.com/SYMID:
          f:replication.storage.dell.com/driverName:
          f:replication.storage.dell.com/remoteClusterID:
      f:spec:
        .:
        f:action:
        f:driverName:
        f:protectionGroupAttributes:
          .:
          f:powermax/RdfGroup:
          f:powermax/RdfMode:
          f:powermax/RemoteRDFGroup:
          f:powermax/RemoteSYMID:
          f:powermax/SYMID:
        f:protectionGroupId:
        f:remoteClusterId:
        f:remoteProtectionGroupAttributes:
          .:
          f:powermax/RdfGroup:
          f:powermax/RdfMode:
          f:powermax/RemoteRDFGroup:
          f:powermax/RemoteSYMID:
          f:powermax/SYMID:
        f:remoteProtectionGroupId:
    Manager:      dell-csi-replicator
    Operation:    Update
    Time:         2024-09-09T14:24:41Z
    API Version:  replication.storage.dell.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
        f:lastAction:
          .:
          f:condition:
          f:errorMessage:
          f:firstFailure:
          f:time:
        f:replicationLinkState:
          .:
          f:isSource:
          f:lastSuccessfulUpdate:
          f:state:
        f:state:
    Manager:         dell-csi-replicator
    Operation:       Update
    Subresource:     status
    Time:            2024-09-09T14:28:02Z
  Resource Version:  23465070
  UID:               09559cd0-5976-4954-947b-946821d25921
Spec:
  Action:
  Driver Name:  csi-powermax.dellemc.com
  Protection Group Attributes:
    powermax/RdfGroup:        12
    powermax/RdfMode:         SYNC
    powermax/RemoteRDFGroup:  12
    powermax/RemoteSYMID:     000220002171
    powermax/SYMID:           000220002131
  Protection Group Id:        csi-rep-sg-postgres-sts-12-SYNC
  Remote Cluster Id:          ocps2
  Remote Protection Group Attributes:
    powermax/RdfGroup:         12
    powermax/RdfMode:          SYNC
    powermax/RemoteRDFGroup:   12
    powermax/RemoteSYMID:      000220002131
    powermax/SYMID:            000220002171
  Remote Protection Group Id:  csi-rep-sg-postgres-sts-12-SYNC
Status:
  Conditions:
    Condition:  Replication Link State:IsSource changed from (false) to (true)
    Time:       2024-09-09T14:25:02Z
    Condition:  Action REPROTECT_LOCAL failed with error rpc error: code = InvalidArgument desc = missing globalID in protection group attributes
    Time:       2024-09-09T14:24:41Z
    Condition:  Replication Link State:IsSource changed from (true) to (false)
    Time:       2024-09-09T14:20:02Z
    Condition:  Action FAILOVER_REMOTE failed with error rpc error: code = InvalidArgument desc = can't find `systemName` parameter in replication group
    Time:       2024-09-09T14:19:33Z
    Condition:  Replication Link State:IsSource changed from (false) to (true)
    Time:       2024-09-01T00:39:01Z
  Last Action:
    Condition:      Action REPROTECT_LOCAL failed with error rpc error: code = InvalidArgument desc = missing globalID in protection group attributes
    Error Message:  rpc error: code = InvalidArgument desc = missing globalID in protection group attributes
    First Failure:  2024-09-09T14:24:41Z
    Time:           2024-09-09T14:24:41Z
  Replication Link State:
    Is Source:               true
    Last Successful Update:  2024-09-09T14:28:02Z
    State:                   SYNCHRONIZED
  State:                     Error
Events:
  Type     Reason   Age                   From                         Message
  ----     ------   ----                  ----                         -------
  Warning  Error    9m1s                  dell-csi-replicator          Action [FAILOVER_REMOTE] on DellCSIReplicationGroup [rg-47d46288-c551-4e73-b8a9-b41113248b3f] failed with error [rpc error: code = InvalidArgument desc = can't find `systemName` parameter in replication group]
  Warning  Error    9m1s                  dell-csi-replicator          Action [FAILOVER_REMOTE] on DellCSIReplicationGroup [rg-47d46288-c551-4e73-b8a9-b41113248b3f] failed with error [rpc error: code = InvalidArgument desc = missing globalID in protection group attributes]
  Warning  Updated  3m53s (x8 over 9m1s)  dell-replication-controller  failed to process the last action Action FAILOVER_REMOTE failed with error rpc error: code = InvalidArgument desc = can't find `systemName` parameter in replication group
  Warning  Error    3m53s                 dell-csi-replicator          Action [REPROTECT_LOCAL] on DellCSIReplicationGroup [rg-47d46288-c551-4e73-b8a9-b41113248b3f] failed with error [rpc error: code = InvalidArgument desc = missing globalID in protection group attributes]
  Warning  Error    3m53s                 dell-csi-replicator          Action [REPROTECT_LOCAL] on DellCSIReplicationGroup [rg-47d46288-c551-4e73-b8a9-b41113248b3f] failed with error [rpc error: code = InvalidArgument desc = can't find `systemName` parameter in replication group]
  Warning  Updated  32s (x5 over 3m53s)   dell-replication-controller  failed to process the last action Action REPROTECT_LOCAL failed with error rpc error: code = InvalidArgument desc = missing globalID in protection group attributes
[corood@csahn01 repctl]$
santhoshatdell commented 2 months ago

Hi @anandhg02 : Do you also have powerstore driver installed with replication enabled?

The errors that you see are actually from different drivers. For instance, 'can't find systemName parameter' is from isilon and 'missing globalID in protection group' is from powerstore. We have not tested the replication module with multiple drivers installed. I would suggest installing only the powermax driver and test. Thanks!

anandhg02 commented 2 months ago

Hi @santhoshatdell ,

Yes we do have PowerMAX/PowerStore/PowerScale drivers installed in this OCP cluster. The business requirement for this OCP cluster is to provision PVs from multiple storage tiers: Tier1 from PowerMAX, Tier2 from PowerStore, Tier3 from PowerScale.

But I am curious, why should PowerStore & PowerScale errors report under an RG that uses the csi-powermax.dellemc.com driver. And this error appears when I perform Failover/Reprotect on the RG PowerMAX. While setting up a new RG, there are no errors reported.

santhoshatdell commented 1 month ago

Our initial investigation pointed out that the replicator side car in each of the installed driver pods might process the same RG which leads to this. I mean that RGs of other drivers are not ignored.

anandhg02 commented 1 month ago

I think I don't have a choice but to use all the 3 drivers (PowerMax/PowerStore/PowerScale) in the same OCP cluster for provisioning across multiple storage tiers.

>>> For instance, 'can't find systemName parameter' is from isilon and 'missing globalID in protection group' is from powerstore.

Are we able to determine what is causing the errors for the PowerStore or PowerScale for the above errors?

hoppea2 commented 1 month ago

/sync

csmbot commented 1 month ago

link: 28173

anandhg02 commented 1 month ago

Hi Team any update on this issue?

khareRajshree commented 1 month ago

Hi @anandhg02, this will be taken as a feature for implementation in our roadmap. Closing this issue for now as we have updated the respective documentation. Thanks.

anandhg02 commented 1 month ago

Hi @khareRajshree, Noted on the roadmap. Can you confirm if installing multiple CSI drivers (PMAX/PSTR/PSCALE) in the same OCP cluster is supported or not? I don't see any document that say only one CSI driver with replication is to be installed per OCP cluster.

shanmydell commented 4 weeks ago

https://github.com/dell/csm/issues/1511 has been added to address the issue stated in our roadmap