dell / csm

Dell Container Storage Modules (CSM)
Apache License 2.0
70 stars 15 forks source link

[BUG]: add NVMeTCP connection parameter ctrl-loss-tmo=-1 to implement powerstore best practices #1459

Closed dancohen21 closed 1 month ago

dancohen21 commented 2 months ago

Bug Description

The Dell Linux host connectivity guide recommends on page 214 https://elabnavigator.dell.com/vault/pdf/Linux.pdf?key=1725374107988

By default, the Linux controller enters a reconnect state when it loses connection with the target. The default timeout for reconnecting is 10 minutes. However, a PowerStore node reboot may take more than 10 minutes. It is recommended to set ctrl-loss-tmo = -1 to keep the controller constantly reconnecting.

Per this SUSE documentation [https://documentation.suse.com/sles/15-SP5/html/SLES-all/cha-nvmeof.html] In case of a path loss, the NVMe subsystem tries to reconnect for a time period, defined by the ctrl-loss-tmo option of the nvme connect command

I'm concerned that this ctrl-loss-tmo = -1 parameter will be required for the NVMeTCP connection to reconnect to PowerStore nodes when performing a PowerStore NDU (non-disruptive code upgrade) where the PowerStore nodes reboot, one at a time, and during a code update, the nodes very well may be unavailable for longer than the default path timeout.

My novice reading of the code: nvmeTCPConnect function in gonvme_tcp_fc.go does not include this parameter

if duplicateConnect { exe = nvme.buildNVMeCommand([]string{NVMeCommand, "connect", "-t", "tcp", "-n", target.TargetNqn, "-a", target.Portal, "-s", NVMePort, "-D"}) } else { exe = nvme.buildNVMeCommand([]string{NVMeCommand, "connect", "-t", "tcp", "-n", target.TargetNqn, "-a", target.Portal, "-s", NVMePort}) }

If a change is needed; I also request that current supported CSI-powerstore driver builds be updated so that (for example) an OpenShift 4.14 environment using CSM-Operator 1.5.1 and CSI driver 2.10.1 can get this enhancement

Logs

no logs available ; see Dell SR 197072815

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

Perform a PowerStore code upgrade / NDU from 3.6.0.0 to 3.6.1.2 for example with OpenShift attached using PVs

Expected Behavior

Hosts should be able to survive paths to storage going away and coming back during all normal data center operations

CSM Driver(s)

csi-powerstore 2.10.1

Installation Type

csm-operator 1.5.1

Container Storage Modules Enabled

No response

Container Orchestrator

OpenShift 4.14

Operating System

OpenShift Linux - RHCOS based on RHEL 9.2

csmbot commented 2 months ago

@dancohen21: Thank you for submitting this issue!

The issue is currently awaiting triage. Please make sure you have given us as much context as possible.

If the maintainers determine this is a relevant issue, they will remove the needs-triage label and respond appropriately.


We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at container.storage.modules@dell.com.

suryagupta4 commented 1 month ago

Hi @dancohen21, can this request be part of https://github.com/dell/csm/issues/1465? I see https://github.com/dell/csm/issues/1465 is a broader scope around NVMe best practices and also contains this issue somewhere. Please update so that we can close this issue and track these in the other one, thanks.

suryagupta4 commented 1 month ago

link: 28462

donatwork commented 1 month ago

Resolving. New documentation for the csi-powerstore driver will be published in CSM 1.12 and the ctrl_loss_tmo will be disabled for any NVMe connections.