Closed sfr-financial-com closed 3 months ago
The above logs are seen at the time of NodeStageVolume, which means a PV already exists at this point and is bound to a PVC.
At the time of NodeStageVolume, Trident attempts to log in to iSCSI portals. In old versions of Trident the NodeStageVolume attempts would fail even if one of the portals were down or inaccessible however, the behavior was enhanced in newer versions of Trident where even if one of the portals is up and running it would complete NodeStageVolume operation. I am not 100% sure but I think the behavior was enhanced in 23.01 or 23.04.
Hi Rohit, thanks for the quick turnaround. I uninstalled 22.0.10 and installed 24.02.0, but it doesn't look like that's improved things:
time="2024-07-02T07:05:15Z" level=info msg="Updated Trident controller with node registration." logLayer=csi_frontend node=muc1pro-k8s-worker-1.k8s.muc1pro.financial.com requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Portal '' is missing target IQN; it may not be considered for iSCSI self-healing." logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Failed to add portal to self-healing map; err: portal value cannot be empty" logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Portal '' is missing target IQN; it may not be considered for iSCSI self-healing." logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Failed to add portal to self-healing map; err: portal value cannot be empty" logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Portal '' is missing target IQN; it may not be considered for iSCSI self-healing." logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Failed to add portal to self-healing map; err: portal value cannot be empty" logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Portal '' is missing target IQN; it may not be considered for iSCSI self-healing." logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=error msg="Failed to add portal to self-healing map; err: portal value cannot be empty" logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=info msg="iSCSI self-healing is enabled." iSCSISelfHealingInterval=5m0s iSCSISelfHealingWaitTime=7m0s logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=info msg="NVMe self-healing is enabled." NVMeSelfHealingInterval=5m0s logLayer=csi_frontend requestID=6279c13c-7391-4ee8-9c91-ababd5472da4 requestSource=Internal workflow="plugin=activate"
time="2024-07-02T07:05:15Z" level=info msg="Listening for GRPC connections." name=/plugin/csi.sock net=unix
time="2024-07-02T07:05:52Z" level=info msg="Target exists already" Target="iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11" logLayer=csi_frontend requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:02Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=iscsiadm requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:02Z" level=error msg="Error logging in to iSCSI target." error="process killed after timeout" logLayer=csi_frontend requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:02Z" level=error msg="Login to iSCSI target failed." err="process killed after timeout" logLayer=csi_frontend portalIP=10.0.147.252 requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:07Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:08Z" level=info msg="Target exists already" Target="iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11" logLayer=csi_frontend requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:18Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=iscsiadm requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:18Z" level=error msg="Error logging in to iSCSI target." error="process killed after timeout" logLayer=csi_frontend requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:18Z" level=error msg="Login to iSCSI target failed." err="process killed after timeout" logLayer=csi_frontend portalIP=10.0.147.252 requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:23Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:23Z" level=error msg="GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = failed to stage volume: process killed after timeout" logLayer=csi_frontend requestID=d0d12556-7046-4386-95b8-e5e6598705b1 requestSource=CSI
time="2024-07-02T07:06:28Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=cf709c1e-ba6b-4381-917e-04a0b7f43dfe requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:34Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=cf709c1e-ba6b-4381-917e-04a0b7f43dfe requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:41Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=cf709c1e-ba6b-4381-917e-04a0b7f43dfe requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:48Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=cf709c1e-ba6b-4381-917e-04a0b7f43dfe requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:06:48Z" level=error msg="GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = failed to stage volume: process killed after timeout" logLayer=csi_frontend requestID=cf709c1e-ba6b-4381-917e-04a0b7f43dfe requestSource=CSI
time="2024-07-02T07:08:25Z" level=info msg="Target exists already" Target="iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11" logLayer=csi_frontend requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:35Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=iscsiadm requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:35Z" level=error msg="Error logging in to iSCSI target." error="process killed after timeout" logLayer=csi_frontend requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:35Z" level=error msg="Login to iSCSI target failed." err="process killed after timeout" logLayer=csi_frontend portalIP=10.0.147.252 requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:40Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:41Z" level=info msg="Target exists already" Target="iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11" logLayer=csi_frontend requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:51Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=iscsiadm requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:51Z" level=error msg="Error logging in to iSCSI target." error="process killed after timeout" logLayer=csi_frontend requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:51Z" level=error msg="Login to iSCSI target failed." err="process killed after timeout" logLayer=csi_frontend portalIP=10.0.147.252 requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:56Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:08:56Z" level=error msg="GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = failed to stage volume: process killed after timeout" logLayer=csi_frontend requestID=393bea49-3e5b-4ef8-897f-c715ee9d49e1 requestSource=CSI
time="2024-07-02T07:09:02Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=e28d8123-b2d6-4fff-916b-4744b0d91b90 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:09:08Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=e28d8123-b2d6-4fff-916b-4744b0d91b90 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:09:14Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=e28d8123-b2d6-4fff-916b-4744b0d91b90 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:09:22Z" level=error msg="process killed after timeout" logLayer=csi_frontend process=blkid requestID=e28d8123-b2d6-4fff-916b-4744b0d91b90 requestSource=CSI workflow="node_server=stage"
time="2024-07-02T07:09:22Z" level=error msg="GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = failed to stage volume: process killed after timeout" logLayer=csi_frontend requestID=e28d8123-b2d6-4fff-916b-4744b0d91b90 requestSource=CSI
Of the two portals, 10.0.147.252 is down, while 10.0.147.251 is still up, however, I don't see any attempts to log into the working portal, either:
[root@muc1pro-k8s-worker-3 ~]# iscsiadm -m node
10.0.147.251:3260,1033 iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11
10.0.147.252:3260,1034 iqn.1992-08.com.netapp:sn.b062d8411cdd11eaa34500a098bfa9b9:vs.11
[root@muc1pro-k8s-worker-3 ~]# multipath -ll
3600a098038304476785d4f6d6769634a dm-6 NETAPP,LUN C-Mode
size=93G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=enabled
`- 2:0:0:0 sdc 8:32 failed faulty running
3600a098038304476785d4f6d67696356 dm-7 NETAPP,LUN C-Mode
size=1.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=0 status=enabled
`- 2:0:0:1 sdd 8:48 failed faulty running
I just updated to 24.06.0 and also deleted the existing PVC for an unimportant workload, and this time, the PVC/PV creation completed successfully. However, trying to connect an already existing PVC (that wasn't used in about 9 months) failed:
Normal Scheduled 31s default-scheduler Successfully assigned default/storagetest-muc1pro-iscsi to muc1pro-k8s-worker-3.k8s.muc1pro.financial.com
Normal SuccessfulAttachVolume 31s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-fe961121-673a-4978-8f2d-2d16abd80ee9"
Warning FailedMount 2s kubelet MountVolume.MountDevice failed for volume "pvc-fe961121-673a-4978-8f2d-2d16abd80ee9" : rpc error: code = Internal desc = rpc error: code = Internal desc = failed to stage volume: process killed after timeout
Is there anything I can to to access already existing PVCs again?
@sfr-financial-com Do appreciate if you file a support ticket to allow further action / investigation.
Hi @vasum0406 , we currently don't have an active support contract, so that's not going to work. However, we managed to replace the defective parts, so the problem doesn't occur anymore. I'm closing this therefore, since I can't reproduce.
Describe the bug After a hardware failure on our NetAPP ONTAP 9.10 system, one of the two IPs used for iSCSI is down. While existing pods continue to work normally (due to TCP multipathing), creating new pods fails:
Environment Provide accurate information about the environment to help us reproduce the issue.
To Reproduce Shutdown one of the two nodes in a NetApp.
Expected behavior New PVCs should be created even if iSCSI login into one portal fails.
Additional context Yeah, I know, hardware failures like that are edge cases.