Closed dancohen21 closed 1 month ago
@dancohen21: Thank you for submitting this issue!
The issue is currently awaiting triage. Please make sure you have given us as much context as possible.
If the maintainers determine this is a relevant issue, they will remove the needs-triage label and respond appropriately.
We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at container.storage.modules@dell.com.
Hi @dancohen21, can this request be part of https://github.com/dell/csm/issues/1465? I see https://github.com/dell/csm/issues/1465 is a broader scope around NVMe best practices and also contains this issue somewhere. Please update so that we can close this issue and track these in the other one, thanks.
link: 28462
Resolving. New documentation for the csi-powerstore driver will be published in CSM 1.12 and the ctrl_loss_tmo will be disabled for any NVMe connections.
Bug Description
The Dell Linux host connectivity guide recommends on page 214 https://elabnavigator.dell.com/vault/pdf/Linux.pdf?key=1725374107988
By default, the Linux controller enters a reconnect state when it loses connection with the target. The default timeout for reconnecting is 10 minutes. However, a PowerStore node reboot may take more than 10 minutes. It is recommended to set ctrl-loss-tmo = -1 to keep the controller constantly reconnecting.
Per this SUSE documentation [https://documentation.suse.com/sles/15-SP5/html/SLES-all/cha-nvmeof.html] In case of a path loss, the NVMe subsystem tries to reconnect for a time period, defined by the ctrl-loss-tmo option of the nvme connect command
I'm concerned that this ctrl-loss-tmo = -1 parameter will be required for the NVMeTCP connection to reconnect to PowerStore nodes when performing a PowerStore NDU (non-disruptive code upgrade) where the PowerStore nodes reboot, one at a time, and during a code update, the nodes very well may be unavailable for longer than the default path timeout.
My novice reading of the code: nvmeTCPConnect function in gonvme_tcp_fc.go does not include this parameter
if duplicateConnect { exe = nvme.buildNVMeCommand([]string{NVMeCommand, "connect", "-t", "tcp", "-n", target.TargetNqn, "-a", target.Portal, "-s", NVMePort, "-D"}) } else { exe = nvme.buildNVMeCommand([]string{NVMeCommand, "connect", "-t", "tcp", "-n", target.TargetNqn, "-a", target.Portal, "-s", NVMePort}) }
If a change is needed; I also request that current supported CSI-powerstore driver builds be updated so that (for example) an OpenShift 4.14 environment using CSM-Operator 1.5.1 and CSI driver 2.10.1 can get this enhancement
Logs
no logs available ; see Dell SR 197072815
Screenshots
No response
Additional Environment Information
No response
Steps to Reproduce
Perform a PowerStore code upgrade / NDU from 3.6.0.0 to 3.6.1.2 for example with OpenShift attached using PVs
Expected Behavior
Hosts should be able to survive paths to storage going away and coming back during all normal data center operations
CSM Driver(s)
csi-powerstore 2.10.1
Installation Type
csm-operator 1.5.1
Container Storage Modules Enabled
No response
Container Orchestrator
OpenShift 4.14
Operating System
OpenShift Linux - RHCOS based on RHEL 9.2