Closed kincl closed 4 years ago
Please let me know what I can do to move this bug along, thanks!
@kincl we just released support for bi-directional CHAP for ONTAP in Trident 20.04. Can you test if this workaround will fix your issue as you suggested previously?
@gnarl I can confirm that after converting our install to use CHAP with iSCSI that the initiators connect to the correct target portals as specified by SLM
@kincl thanks for the update. We will look into the original issue you reported but I am glad you have a work around for now.
@kincl We just did an investigation on this issue and think it was actually fixed with the Trident 20.04 release. The commit was made in February which included this new func.
Can you temporarily add a new backend that doesn't use CHAP and verify that Trident 20.04 fixes the issue for you?
@kincl, were you able to check on this?
Hey @gnarl no I cannot test this easily since it would require setting up a new SVM but if you think it has been fixed I am okay with closing this issue
@kincl, thanks for your response.
May I know whether the fix has been released? I'm using trident 20.07 & openshift 4.4 and still encountered the same issue. The environment is 4 nodes ontap cluster, ISCSI connection, and I noticed that trident is using non reporting-nodes the ISCSI LIFs rather than on the LIF in reporting nodes. Hence it won't find the LUN. e.g PV created from aggregate in node 1, the iscsi initiator in RHCOS is trying to login and scan from n3 ISCSI lif and the pod is stuck because it can't find the PV.
iscsi.service - Login and scanning of iSCSI devices
Loaded: loaded (/usr/lib/systemd/system/iscsi.service; enabled; vendor preset: disabled)
Active: active (exited) since Wed 2020-08-19 10:51:01 UTC; 3min 6s ago
Docs: man:iscsiadm(8)
man:iscsid(8)
Process: 1495 ExecStart=/usr/sbin/iscsiadm -m node --loginall=automatic (code=exited, status=0/SUCCESS)
Main PID: 1495 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 205335)
Memory: 0B
CPU: 0
CGroup: /system.slice/iscsi.service
Aug 19 10:51:01
This issue is fixed with commit 594f9d and is included in Trident v20.10.0.
Describe the bug We are running Trident with an ontap-san backend connecting to a ONTAP cluster with 12 iSCSI data LIFs (one per node as required) without CHAP auth. When we create a PVC and schedule a pod that mounts it on a node with no existing sessions the CSI driver on the node will create sessions to all portal IPs instead of just the data LIFs that SLM reports. When the pod completes and the PVC is unscheduled from the node the CSI driver will only logout of the data LIFs that SLM is reporting which leaves in our case 10 sessions to wrong data LIFs. The next time a PVC needs to mount the CSI driver will see the existing sessions to the wrong data LIFs and will proceed without logging in which results in the second PVC never mounting.
Environment Provide accurate information about the environment to help us reproduce the issue.
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Remounting a PVC on a host with a new pod should succeed as well as mounting new PVCs on the same host.
More specifically, if we mount a iSCSI PVC we should only login to the portals identified by SLM and we should definitely logout of all sessions that we logged into.
Additional context Add any other context about the problem here.
I believe the issue is at utils/osutils.go#L142, we should be using
bkportal
instead ofportalIps
.Below are logs from the first successful mount on a node with no existing sessions. Note that the GRPC request returns p1 and p2 which are the correct SLM reporting portals and is logged as targetPortals. If I am correct, if we were using CHAP then this would work since the CHAP code correctly uses the SLM reporting portals to connect.
Logs from the CSI driver pod on the first mount: