NetApp / trident

Storage orchestrator for containers
Apache License 2.0
732 stars 218 forks source link

Failed iSCSI Login When Portal Is Down #909

Open sfr-financial-com opened 4 days ago

sfr-financial-com commented 4 days ago

Describe the bug After a hardware failure on our NetAPP ONTAP 9.10 system, one of the two IPs used for iSCSI is down. While existing pods continue to work normally (due to TCP multipathing), creating new pods fails:

time="2024-07-01T08:56:39Z" level=info msg="Target exists already" Target="iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11" logLayer=csi_frontend requestID=28278172-b968-4ba9-bd3c-c91321f5a9da requestSource=CSI workflow="node_server=stage"
time="2024-07-01T08:56:49Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=iscsiadm requestID=28278172-b968-4ba9-bd3c-c91321f5a9da requestSource=CSI workflow="node_server=stage"
time="2024-07-01T08:56:49Z" level=error msg="Error logging in to iSCSI target." error="process killed after timeout" logLayer=csi_frontend requestID=28278172-b968-4ba9-bd3c-c91321f5a9da requestSource=CSI workflow="node_server=stage"
time="2024-07-01T08:56:49Z" level=error msg="Login to iSCSI target failed." err="process killed after timeout" logLayer=csi_frontend portalIP=10.0.147.252 requestID=28278172-b968-4ba9-bd3c-c91321f5a9da requestSource=CSI workflow="node_server=stage"
time="2024-07-01T08:56:55Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=28278172-b968-4ba9-bd3c-c91321f5a9da requestSource=CSI workflow="node_server=stage"
time="2024-07-01T08:56:55Z" level=error msg="GRPC error: rpc error: code = Internal desc = failed to stage volume: process killed after timeout" logLayer=csi_frontend requestID=28278172-b968-4ba9-bd3c-c91321f5a9da requestSource=CSI

Environment Provide accurate information about the environment to help us reproduce the issue.

To Reproduce Shutdown one of the two nodes in a NetApp.

Expected behavior New PVCs should be created even if iSCSI login into one portal fails.

Additional context Yeah, I know, hardware failures like that are edge cases.

rohit-arora-dev commented 3 days ago

The above logs are seen at the time of NodeStageVolume, which means a PV already exists at this point and is bound to a PVC.

At the time of NodeStageVolume, Trident attempts to log in to iSCSI portals. In old versions of Trident the NodeStageVolume attempts would fail even if one of the portals were down or inaccessible however, the behavior was enhanced in newer versions of Trident where even if one of the portals is up and running it would complete NodeStageVolume operation. I am not 100% sure but I think the behavior was enhanced in 23.01 or 23.04.

sfr-financial-com commented 3 days ago

Hi Rohit, thanks for the quick turnaround. I uninstalled 22.0.10 and installed 24.02.0, but it doesn't look like that's improved things:

time="2024-07-02T07:05:15Z" level=info msg="Updated Trident controller with node registration." logLayer=csi_frontend node=muc1pro-k8s-worker-1.k8s.muc1pro.financial.com requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Portal '' is missing target IQN; it may not be considered for iSCSI self-healing." logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Failed to add portal  to self-healing map; err: portal value cannot be empty" logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Portal '' is missing target IQN; it may not be considered for iSCSI self-healing." logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Failed to add portal  to self-healing map; err: portal value cannot be empty" logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Portal '' is missing target IQN; it may not be considered for iSCSI self-healing." logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Failed to add portal  to self-healing map; err: portal value cannot be empty" logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Portal '' is missing target IQN; it may not be considered for iSCSI self-healing." logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Failed to add portal  to self-healing map; err: portal value cannot be empty" logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=info msg="iSCSI self-healing is enabled." iSCSISelfHealingInterval=5m0s iSCSISelfHealingWaitTime=7m0s logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=info msg="NVMe self-healing is enabled." NVMeSelfHealingInterval=5m0s logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=info msg="Listening for GRPC connections." name=/plugin/csi.sock net=unix
time="2024-07-02T07:05:52Z" level=info msg="Target exists already" Target="iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11" logLayer=csi_frontend requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:02Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=iscsiadm requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:02Z" level=error msg="Error logging in to iSCSI target." error="process killed after timeout" logLayer=csi_frontend requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:02Z" level=error msg="Login to iSCSI target failed." err="process killed after timeout" logLayer=csi_frontend portalIP=10.0.147.252 requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:07Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:08Z" level=info msg="Target exists already" Target="iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11" logLayer=csi_frontend requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:18Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=iscsiadm requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:18Z" level=error msg="Error logging in to iSCSI target." error="process killed after timeout" logLayer=csi_frontend requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:18Z" level=error msg="Login to iSCSI target failed." err="process killed after timeout" logLayer=csi_frontend portalIP=10.0.147.252 requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:23Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:23Z" level=error msg="GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = failed to stage volume: process killed after timeout" logLayer=csi_frontend requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI
time="2024-07-02T07:06:28Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=cf709c1e-ba6b-4381-917e-04a0b7f43dfe requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:34Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=cf709c1e-ba6b-4381-917e-04a0b7f43dfe requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:41Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=cf709c1e-ba6b-4381-917e-04a0b7f43dfe requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:48Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=cf709c1e-ba6b-4381-917e-04a0b7f43dfe requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:48Z" level=error msg="GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = failed to stage volume: process killed after timeout" logLayer=csi_frontend requestID=cf709c1e-ba6b-4381-917e-04a0b7f43dfe requestSource=CSI
time="2024-07-02T07:08:25Z" level=info msg="Target exists already" Target="iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11" logLayer=csi_frontend requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:35Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=iscsiadm requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:35Z" level=error msg="Error logging in to iSCSI target." error="process killed after timeout" logLayer=csi_frontend requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:35Z" level=error msg="Login to iSCSI target failed." err="process killed after timeout" logLayer=csi_frontend portalIP=10.0.147.252 requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:40Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:41Z" level=info msg="Target exists already" Target="iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11" logLayer=csi_frontend requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:51Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=iscsiadm requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:51Z" level=error msg="Error logging in to iSCSI target." error="process killed after timeout" logLayer=csi_frontend requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:51Z" level=error msg="Login to iSCSI target failed." err="process killed after timeout" logLayer=csi_frontend portalIP=10.0.147.252 requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:56Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:56Z" level=error msg="GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = failed to stage volume: process killed after timeout" logLayer=csi_frontend requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI
time="2024-07-02T07:09:02Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=e28d8123-b2d6-4fff-916b-4744b0d91b90 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:09:08Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=e28d8123-b2d6-4fff-916b-4744b0d91b90 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:09:14Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=e28d8123-b2d6-4fff-916b-4744b0d91b90 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:09:22Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=e28d8123-b2d6-4fff-916b-4744b0d91b90 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:09:22Z" level=error msg="GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = failed to stage volume: process killed after timeout" logLayer=csi_frontend requestID=e28d8123-b2d6-4fff-916b-4744b0d91b90 requestSource=CSI

Of the two portals, 10.0.147.252 is down, while 10.0.147.251 is still up, however, I don't see any attempts to log into the working portal, either:

[root@muc1pro-k8s-worker-3 ~]# iscsiadm -m node
10.0.147.251:3260,1033 iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11
10.0.147.252:3260,1034 iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11
[root@muc1pro-k8s-worker-3 ~]# multipath -ll
3600a098038304476785d4f6d6769634a dm-6 NETAPP,LUN C-Mode
size=93G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=enabled
  `- 2:0:0:0 sdc 8:32 failed faulty running
3600a098038304476785d4f6d67696356 dm-7 NETAPP,LUN C-Mode
size=1.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=enabled
  `- 2:0:0:1 sdd 8:48 failed faulty running
sfr-financial-com commented 3 days ago

I just updated to 24.06.0 and also deleted the existing PVC for an unimportant workload, and this time, the PVC/PV creation completed successfully. However, trying to connect an already existing PVC (that wasn't used in about 9 months) failed:

  Normal   Scheduled               31s   default-scheduler        Successfully assigned default/storagetest-muc1pro-iscsi to muc1pro-k8s-worker-3.k8s.muc1pro.financial.com
  Normal   SuccessfulAttachVolume  31s   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-fe961121-673a-4978-8f2d-2d16abd80ee9"
  Warning  FailedMount             2s    kubelet                  MountVolume.MountDevice failed for volume "pvc-fe961121-673a-4978-8f2d-2d16abd80ee9" : rpc error: code = Internal desc = rpc error: code = Internal desc = failed to stage volume: process killed after timeout