NetApp / trident

Storage orchestrator for containers
Apache License 2.0
760 stars 222 forks source link

Trident CSI with ontap-san backend does not login to subset of portals reported by SLM #387

Closed kincl closed 4 years ago

kincl commented 4 years ago

Describe the bug We are running Trident with an ontap-san backend connecting to a ONTAP cluster with 12 iSCSI data LIFs (one per node as required) without CHAP auth. When we create a PVC and schedule a pod that mounts it on a node with no existing sessions the CSI driver on the node will create sessions to all portal IPs instead of just the data LIFs that SLM reports. When the pod completes and the PVC is unscheduled from the node the CSI driver will only logout of the data LIFs that SLM is reporting which leaves in our case 10 sessions to wrong data LIFs. The next time a PVC needs to mount the CSI driver will see the existing sessions to the wrong data LIFs and will proceed without logging in which results in the second PVC never mounting.

Environment Provide accurate information about the environment to help us reproduce the issue.

To Reproduce Steps to reproduce the behavior:

Expected behavior A clear and concise description of what you expected to happen.

Remounting a PVC on a host with a new pod should succeed as well as mounting new PVCs on the same host.

More specifically, if we mount a iSCSI PVC we should only login to the portals identified by SLM and we should definitely logout of all sessions that we logged into.

Additional context Add any other context about the problem here.

I believe the issue is at utils/osutils.go#L142, we should be using bkportal instead of portalIps.

Below are logs from the first successful mount on a node with no existing sessions. Note that the GRPC request returns p1 and p2 which are the correct SLM reporting portals and is logged as targetPortals. If I am correct, if we were using CHAP then this would work since the CHAP code correctly uses the SLM reporting portals to connect.

Logs from the CSI driver pod on the first mount:

time="2020-04-22T20:47:48Z" level=debug msg="GRPC call: /csi.v1.Node/NodeStageVolume"
time="2020-04-22T20:47:48Z" level=debug msg="GRPC request: volume_id:\"pvc-c5d1e557-b4a9-4dc1-b22e-4bdf16b996d7\" publish_context:<key:\"filesystemType\" value:\"ext4\" > publish_context:<key:\"iscsiIgroup\" value:\"cluster1-k8s\" > publish_context:<key:\"iscsiInitiatorSecret\" value:\"\" > publish_context:<key:\"iscsiInterface\" value:\"\" > publish_context:<key:\"iscsiLunNumber\" value:\"17\" > publish_context:<key:\"iscsiTargetIqn\" value:\"iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31\" > publish_context:<key:\"iscsiTargetPortalCount\" value:\"2\" > publish_context:<key:\"iscsiTargetSecret\" value:\"\" > publish_context:<key:\"iscsiUsername\" value:\"\" > publish_context:<key:\"mountOptions\" value:\"\" > publish_context:<key:\"p1\" value:\"172.30.253.2\" > publish_context:<key:\"p2\" value:\"172.30.253.3\" > publish_context:<key:\"protocol\" value:\"block\" > publish_context:<key:\"sharedTarget\" value:\"true\" > publish_context:<key:\"useCHAP\" value:\"false\" > staging_target_path:\"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-c5d1e557-b4a9-4dc1-b22e-4bdf16b996d7/globalmount\" volume_capability:<mount:<fs_type:\"ext4\" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:\"backendUUID\" value:\"caf545e2-bd42-4b36-90f9-948c98a87dea\" > volume_context:<key:\"internalName\" value:\"trident_cluster1_block_pvc_c5d1e557_b4a9_4dc1_b22e_4bdf16b996d7\" > volume_context:<key:\"name\" value:\"pvc-c5d1e557-b4a9-4dc1-b22e-4bdf16b996d7\" > volume_context:<key:\"protocol\" value:\"block\" > volume_context:<key:\"storage.kubernetes.io/csiProvisionerIdentity\" value:\"1587583872788-8081-csi.trident.netapp.io\" > "
time="2020-04-22T20:47:48Z" level=debug msg="Attempting to acquire shared lock (NodeStageVolume)." lock=csi_node_server
time="2020-04-22T20:47:48Z" level=debug msg="Acquired shared lock (NodeStageVolume)." lock=csi_node_server
time="2020-04-22T20:47:48Z" level=debug msg=">>>> NodeStageVolume" Method=NodeStageVolume Type=CSI_Node
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.AttachISCSIVolume"
time="2020-04-22T20:47:48Z" level=debug msg="Attaching iSCSI volume." fstype=ext4 lunID=17 mountpoint= targetIQN="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31" targetPortals="[172.30.253.2 172.30.253.3]" volume=trident_cluster1_block_pvc_c5d1e557_b4a9_4dc1_b22e_4bdf16b996d7
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.ISCSISupported"
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.execCommand." args="[-V]" command=iscsiadm
time="2020-04-22T20:47:48Z" level=debug msg="<<<< osutils.execCommand." command=iscsiadm error="<nil>" output="iscsiadm version 6.2.0.874-10"
time="2020-04-22T20:47:48Z" level=debug msg="<<<< osutils.ISCSISupported"
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.iSCSISessionExistsToTargetIQN"
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.getISCSISessionInfo"
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.execCommand." args="[-m session]" command=iscsiadm
time="2020-04-22T20:47:48Z" level=debug msg="<<<< osutils.execCommand." command=iscsiadm error="exit status 21" output="iscsiadm: No active sessions."
time="2020-04-22T20:47:48Z" level=debug msg="No iSCSI session found."
time="2020-04-22T20:47:48Z" level=debug msg="<<<< osutils.getISCSISessionInfo"
time="2020-04-22T20:47:48Z" level=debug msg="<<<< osutils.iSCSISessionExistsToTargetIQN"
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.EnsureISCSISession" hostDataIP=172.30.253.2
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.ISCSISupported"
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.execCommand." args="[-V]" command=iscsiadm
time="2020-04-22T20:47:48Z" level=debug msg="<<<< osutils.execCommand." command=iscsiadm error="<nil>" output="iscsiadm version 6.2.0.874-10"
time="2020-04-22T20:47:48Z" level=debug msg="<<<< osutils.ISCSISupported"
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.iSCSISessionExists"
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.getISCSISessionInfo"
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.execCommand." args="[-m session]" command=iscsiadm
time="2020-04-22T20:47:48Z" level=debug msg="<<<< osutils.execCommand." command=iscsiadm error="exit status 21" output="iscsiadm: No active sessions."
time="2020-04-22T20:47:48Z" level=debug msg="No iSCSI session found."
time="2020-04-22T20:47:48Z" level=debug msg="<<<< osutils.getISCSISessionInfo"
time="2020-04-22T20:47:48Z" level=debug msg="<<<< osutils.iSCSISessionExists"
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.iSCSIDiscovery" portal=172.30.253.2
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.execCommand." args="[-m discovery -t sendtargets -p 172.30.253.2]" command=iscsiadm
time="2020-04-22T20:47:48Z" level=debug msg="<<<< osutils.execCommand." command=iscsiadm error="<nil>" output="172.30.253.2:3260,1067 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31\n172.30.253.13:3260,1094 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31\n172.30.253.12:3260,1093 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31\n172.30.253.11:3260,1092 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31\n172.30.253.10:3260,1091 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31\n172.30.253.9:3260,1090 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31\n172.30.253.8:3260,1089 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31\n172.30.253.7:3260,1088 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31\n172.30.253.6:3260,1087 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31\n172.30.253.5:3260,1086 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31\n172.30.253.4:3260,1085 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31\n172.30.253.3:3260,1084 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="Adding iSCSI discovery info." Portal="172.30.253.2:3260,1067" PortalIP=172.30.253.2 TargetName="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="Adding iSCSI discovery info." Portal="172.30.253.13:3260,1094" PortalIP=172.30.253.13 TargetName="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="Adding iSCSI discovery info." Portal="172.30.253.12:3260,1093" PortalIP=172.30.253.12 TargetName="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="Adding iSCSI discovery info." Portal="172.30.253.11:3260,1092" PortalIP=172.30.253.11 TargetName="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="Adding iSCSI discovery info." Portal="172.30.253.10:3260,1091" PortalIP=172.30.253.10 TargetName="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="Adding iSCSI discovery info." Portal="172.30.253.9:3260,1090" PortalIP=172.30.253.9 TargetName="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="Adding iSCSI discovery info." Portal="172.30.253.8:3260,1089" PortalIP=172.30.253.8 TargetName="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="Adding iSCSI discovery info." Portal="172.30.253.7:3260,1088" PortalIP=172.30.253.7 TargetName="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="Adding iSCSI discovery info." Portal="172.30.253.6:3260,1087" PortalIP=172.30.253.6 TargetName="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="Adding iSCSI discovery info." Portal="172.30.253.5:3260,1086" PortalIP=172.30.253.5 TargetName="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="Adding iSCSI discovery info." Portal="172.30.253.4:3260,1085" PortalIP=172.30.253.4 TargetName="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="Adding iSCSI discovery info." Portal="172.30.253.3:3260,1084" PortalIP=172.30.253.3 TargetName="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31"
time="2020-04-22T20:47:48Z" level=debug msg="<<<< osutils.iSCSIDiscovery"
time="2020-04-22T20:47:48Z" level=debug msg="Found matching iSCSI targets." Targets="[{172.30.253.2:3260,1067 172.30.253.2 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31} {172.30.253.13:3260,1094 172.30.253.13 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31} {172.30.253.12:3260,1093 172.30.253.12 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31} {172.30.253.11:3260,1092 172.30.253.11 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31} {172.30.253.10:3260,1091 172.30.253.10 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31} {172.30.253.9:3260,1090 172.30.253.9 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31} {172.30.253.8:3260,1089 172.30.253.8 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31} {172.30.253.7:3260,1088 172.30.253.7 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31} {172.30.253.6:3260,1087 172.30.253.6 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31} {172.30.253.5:3260,1086 172.30.253.5 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31} {172.30.253.4:3260,1085 172.30.253.4 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31} {172.30.253.3:3260,1084 172.30.253.3 iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31}]"
time="2020-04-22T20:47:48Z" level=debug msg=">>>> osutils.loginISCSITarget" IQN="iqn.1992-08.com.netapp:sn.2eaf0424b53311e9922700a098e1f7a7:vs.31" Portal=172.30.253.2
kincl commented 4 years ago

Please let me know what I can do to move this bug along, thanks!

gnarl commented 4 years ago

@kincl we just released support for bi-directional CHAP for ONTAP in Trident 20.04. Can you test if this workaround will fix your issue as you suggested previously?

kincl commented 4 years ago

@gnarl I can confirm that after converting our install to use CHAP with iSCSI that the initiators connect to the correct target portals as specified by SLM

gnarl commented 4 years ago

@kincl thanks for the update. We will look into the original issue you reported but I am glad you have a work around for now.

gnarl commented 4 years ago

@kincl We just did an investigation on this issue and think it was actually fixed with the Trident 20.04 release. The commit was made in February which included this new func.

Can you temporarily add a new backend that doesn't use CHAP and verify that Trident 20.04 fixes the issue for you?

gnarl commented 4 years ago

@kincl, were you able to check on this?

kincl commented 4 years ago

Hey @gnarl no I cannot test this easily since it would require setting up a new SVM but if you think it has been fixed I am okay with closing this issue

gnarl commented 4 years ago

@kincl, thanks for your response.

anungw commented 4 years ago

May I know whether the fix has been released? I'm using trident 20.07 & openshift 4.4 and still encountered the same issue. The environment is 4 nodes ontap cluster, ISCSI connection, and I noticed that trident is using non reporting-nodes the ISCSI LIFs rather than on the LIF in reporting nodes. Hence it won't find the LUN. e.g PV created from aggregate in node 1, the iscsi initiator in RHCOS is trying to login and scan from n3 ISCSI lif and the pod is stuck because it can't find the PV.

iscsi.service - Login and scanning of iSCSI devices Loaded: loaded (/usr/lib/systemd/system/iscsi.service; enabled; vendor preset: disabled) Active: active (exited) since Wed 2020-08-19 10:51:01 UTC; 3min 6s ago Docs: man:iscsiadm(8) man:iscsid(8) Process: 1495 ExecStart=/usr/sbin/iscsiadm -m node --loginall=automatic (code=exited, status=0/SUCCESS) Main PID: 1495 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 205335) Memory: 0B CPU: 0 CGroup: /system.slice/iscsi.service Aug 19 10:51:01 systemd[1]: Starting Login and scanning of iSCSI devices... Aug 19 10:51:01 iscsiadm[1495]: Logging in to [iface: default, target: iqn.1992-08.com.netapp:sn.d367e378c4d511ea86fc00a0984ef240:vs.13, portal: <redacted - node3 IP>,3260] Aug 19 10:51:01 iscsiadm[1495]: Login to [iface: default, target: iqn.1992-08.com.netapp:sn.d367e378c4d511ea86fc00a0984ef240:vs.13, portal: <redacted - node3 IP>,3260] successful. Aug 19 10:51:01 systemd[1]: Started Login and scanning of iSCSI devices.

gnarl commented 4 years ago

This issue is fixed with commit 594f9d and is included in Trident v20.10.0.