NetApp / trident

Storage orchestrator for containers
Apache License 2.0
732 stars 218 forks source link

Option to disable iSCSI session SelHealing feature #864

Closed ysakashita closed 3 months ago

ysakashita commented 7 months ago

Describe the solution you'd like

I would like to be provided with an option to disable the feature of "automation to detect and fix broken or stale iSCSI sessions on host nodes" in Tridnet v23.01. The feature may cause iSCSI sessions to be logged out at the incorrect time, risking a serious incident. For example, if an iSCSI session is logged out by this function at the perfect timing when a path is switched in Multi Path, the number of alive paths will be zero, leading to a serious failure. Therefore, I would like to disable option of the function.

Describe alternatives you've considered

Mature iSCSI session SelfHealing functionality must be provided. However, until this functionality grows, it must be removed or turned off from Trident. Alternatively, you can not do iSCSI Session SelfHealing for Trident and let open-iscsi session recovery do the work.

Additional context None

rohit-arora-dev commented 7 months ago

Hello @ysakashita

I agree there should be an option to disable or modify this feature. There are two inputs into the daemonset iscsi_self_healing_interval (default: 5 minutes) and iscsi_self_healing_wait_time (default: 7 minutes).

In the meantime, the only way to disable these features is by first disabling the Trident Operator (setting replica count to 0) and then passing iscsi_self_healing_interval=-1 option to the daemonset, alternatively iscsi_self_healing_wait_time can also be used to set the logout time to a higher value.

Ideally, both the configuration parameters (iscsi_self_healing_interval, iscsi_self_healing_wait_time) should be exposed via operator as well as tridentctl installation to disable or modify iSCSI self-healing behaviour.

rohit-arora-dev commented 7 months ago

Minor correction, it has to be a 0 value and not -1. So, iscsi_self_healing_interval=0.

ysakashita commented 7 months ago

Thank you for the configuration parameters. Your idea does not apply to Trident Operator, does it? Unfortunately, I am using the trident operator. So, the daemon set is installed with configuration parameters by the operator.

I would like to be provided the disable option(or tuning value) not only in tridentctl install (custom YAML) but also in trident operator.

ysakashita commented 7 months ago

IMO, I think you can add --iscsi_self_healing_interval=0 parameter to the following code section to make the option to disable also supported by Trident operator. https://github.com/NetApp/trident/blob/v23.10.0/cli/k8s_client/yaml_factory.go#L934-L935

rohit-arora-dev commented 7 months ago

@ysakashita

As part of the enhancement, I agree there should be an option in future releases of Trident that would allow users to override the default behaviour (e.g. disableISCSISelfHealing: true) to disable iSCSI Self-healing via the Trident Operator as well as Helm. It means the Operator would set --iscsi_self_healing_interval=0 in the yaml_factor.go and users need not to do it manually.

Today, this option does not exist, therefore in the absence of this option the only way to achieve it today is:

  1. For tridentctl-based installations: Use custom YAML-based installation and set --iscsi_self_healing_interval=0 on the daemonset.
  2. For Trident Operator-based Installation (after the installation): a. Disable the Trident Operator by setting the Trident Operator deployment replica count to zero. b. Patch Trident daemonset with --iscsi_self_healing_interval=0. c. Please do not re-enable Trident Operator or increase its replica count to 1.

Please note: This is a workaround, the downside of disabling the Trident Operator is that you would lose Trident Operator's capabilities to remediate Trident installation issues, automatic upgrades and watches that ensures Trident is running in a desirable state.

ysakashita commented 7 months ago

@ntap-arorar I seem to be fine with your enhancement idea. And thanks for the workaround.

I can use this workaround in my experimental environment. However, we are providing and managing over 1200 Kubernetes clusters for our customers, so I will wait for the official enhancement.

Please let me know about versions that support this feature if NetApp make a plan.

rohit-arora-dev commented 5 months ago

The fix for this issue is merged https://github.com/NetApp/trident/commit/f1d7e120c81dc33bd04c814e861cd6086b21a20c.

Two configuration parameters have been added to Trident installers (Operator, tridentctl, and Helm):

iSCSI Self-Healing Interval: Changing this value influences at what interval iSCSI Self-healing is run (default 5 mins). A user may configure it to run more often by setting a lower number or less frequently by configuring it to a larger value. Setting this to 0s stops iSCSI self-healing completely.

iSCSI Self-Healing Wait Time: Changing this value influences how much time iSCSI self-healing waits before logging out of an unhealthy session and trying to log in again (default 7 mins). A user may configure it to a larger value so that sessions that are identified as unhealthy have to wait longer before being logged out and then an attempt is made to log in again or a smaller value to log out and log in earlier.

e.g. (Operator)

iscsiSelfHealingInterval: 10m
iscsiSelfHealingWaitTime: 15m

e.g. (tridentctl)

--iscsi-self-healing-interval=10m
--iscsi-self-healing-wait-time=15m
uppuluri123 commented 3 months ago

Fixed in 24.02.