Closed thomasmeeus closed 2 years ago
Hello @thomasmeeus
I noticed you are running OpenShift 3.11 and Trident v19.10.1. This is an unsupported configuration. There have been several enhancements applied to ONTAP SAN multipathing in subsequent releases. I would recommend you upgrade to a more recent version that is compatible with OCP 3.11. Trident v21.04.1 is the latest release that supports OCP 3.11
Closing this issue as there hasn't been a response from the original poster.
Hi there 👋
we have an older Openshift cluster that's showing some weird storage behaviour. At given points (x-hours after a node reboot, after a small network hickup,..) we see massive I/O wait on the Openshift nodes. Some multipath devices are then listed in a
failed faulty running
state. The amount of I/O wait is linear with the amount of iscsi devices that are in faulty state. We see numbers between 20-40% Despite the load & failed devices, the cluster is still functioning normally. The iscsi luns are also still accessible in the pods. The nodes don't recover automatically from this issue. We have to remove iscsi devices by hand.This issue happens frequently and is hard to debug. It could be either Redhat, Openshift, Netapp, Trident or our own config.
We have a working fix when the issue occurs. However the I/O wait is getting annoying and we'd like to implement a more decent fix in the meantime. I hope this story sounds familiar with someone. I'm looking for a push from some storage experts in the right direction.
I found https://github.com/NetApp/trident/issues/101 & https://github.com/NetApp/trident/issues/133 which seem related or at least show the same symptoms.
Versions
Openshift: 3.11.394 Trident: 19.10.1 RHEL: 7.9 Netapp: ontap select - NetApp Release 9.6P1: Fri Jul 19 02:29:12 UTC 2019
Logs
DMESG
Multipath -ll
lsscsi
multipath.conf
Applied workaround