LightBitsLabs / los-csi

Lightbits LightOS CSI Plugin
Other
0 stars 2 forks source link

OS error: no local block device for volume #24

Open majst01 opened 1 year ago

majst01 commented 1 year ago

Rarely we see this error message in some clusters:

OS error: no local block device for volume

Then it is not possible anymore to create and mount new PVCs.

We found that it is possible to recover from this situation without rebooting the affected worker node by running nvme list twice.

First attempt:

[Mon Jun 19 11:41:37 2023] nvme nvme4: rescanning namespaces.
[Mon Jun 19 12:50:41 2023] nvme nvme3: rescanning namespaces.
[Mon Jun 19 12:50:41 2023] nvme nvme4: rescanning namespaces.

$ nvme list

[Mon Jun 19 12:54:05 2023] nvme nvme2: starting error recovery
[Mon Jun 19 12:54:05 2023] nvme nvme2: Reconnecting in 10 seconds...
[Mon Jun 19 12:54:05 2023] block nvme2n1: no usable path - requeuing I/O
[Mon Jun 19 12:54:09 2023] block nvme2n13: no usable path - requeuing I/O
[Mon Jun 19 12:54:09 2023] block nvme2n13: no usable path - requeuing I/O
[Mon Jun 19 12:54:10 2023] block nvme2n8: no usable path - requeuing I/O
[Mon Jun 19 12:54:10 2023] block nvme2n8: no usable path - requeuing I/O
[Mon Jun 19 12:54:11 2023] block nvme2n1: no usable path - requeuing I/O
[Mon Jun 19 12:54:11 2023] block nvme2n1: no usable path - requeuing I/O
[Mon Jun 19 12:54:11 2023] block nvme2n1: no usable path - requeuing I/O
[Mon Jun 19 12:54:11 2023] block nvme2n1: no usable path - requeuing I/O
[Mon Jun 19 12:54:11 2023] block nvme2n1: no usable path - requeuing I/O
[Mon Jun 19 12:54:11 2023] block nvme2n1: no usable path - requeuing I/O
[Mon Jun 19 12:54:11 2023] block nvme2n1: no usable path - requeuing I/O
[Mon Jun 19 12:54:13 2023] block nvme2n6: no usable path - requeuing I/O
[Mon Jun 19 12:54:13 2023] block nvme2n6: no usable path - requeuing I/O
[Mon Jun 19 12:54:14 2023] block nvme2n1: no usable path - requeuing I/O
[Mon Jun 19 12:54:15 2023] nvme nvme2: creating 48 I/O queues.
[Mon Jun 19 12:54:16 2023] nvme nvme2: mapped 48/0/0 default/read/poll queues.
[Mon Jun 19 12:54:16 2023] nvme nvme2: Successfully reconnected (1 attempt)
[Mon Jun 19 12:54:16 2023] block nvme2n14: no available path - failing I/O
[Mon Jun 19 12:54:16 2023] block nvme2n9: no available path - failing I/O
[Mon Jun 19 12:54:16 2023] blk_update_request: operation not supported error, dev nvme2c2n15, sector 0 op 0x0:(READ) flags 0x2000000 phys_seg 32 prio class 0
[Mon Jun 19 12:55:25 2023] EXT4-fs (nvme2n16): mounted filesystem with ordered data mode. Opts: (null)

Then the second attempt:

$ nvme list
/dev/nvme2n1: failed to obtain nvme info: Operation not permitted
/dev/nvme2n10: failed to obtain nvme info: Operation not permitted
/dev/nvme2n13: failed to obtain nvme info: Operation not permitted
/dev/nvme2n14: failed to obtain nvme info: Operation not permitted
/dev/nvme2n15: failed to obtain nvme info: Operation not permitted
/dev/nvme2n2: failed to obtain nvme info: Operation not permitted
/dev/nvme2n4: failed to obtain nvme info: Operation not permitted
/dev/nvme2n6: failed to obtain nvme info: Operation not permitted
/dev/nvme2n7: failed to obtain nvme info: Operation not permitted
/dev/nvme2n8: failed to obtain nvme info: Operation not permitted
/dev/nvme2n9: failed to obtain nvme info: Operation not permitted
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     BTLN9214003V1P6AGN   INTEL SSDPE2KE016T8                      1           1.60  TB /   1.60  TB    512   B +  0 B   VDV10152
/dev/nvme1n1     BTLN921400371P6AGN   INTEL SSDPE2KE016T8                      1           1.60  TB /   1.60  TB    512   B +  0 B   VDV10152
/dev/nvme2n11    46b8b1d8f0c22a37     Lightbits LightOS                        2189       80.53  GB /  80.53  GB      4 KiB +  0 B   2.3
/dev/nvme2n12    46b8b1d8f0c22a37     Lightbits LightOS                        1223       21.47  GB /  21.47  GB      4 KiB +  0 B   2.3
/dev/nvme2n3     46b8b1d8f0c22a37     Lightbits LightOS                        562         5.37  GB /   5.37  GB      4 KiB +  0 B   2.3
/dev/nvme2n5     46b8b1d8f0c22a37     Lightbits LightOS                        850         1.07  GB /   1.07  GB      4 KiB +  0 B   2.3

(nach dem recovern):

$ nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     BTLN9214003V1P6AGN   INTEL SSDPE2KE016T8                      1           1.60  TB /   1.60  TB    512   B +  0 B   VDV10152
/dev/nvme1n1     BTLN921400371P6AGN   INTEL SSDPE2KE016T8                      1           1.60  TB /   1.60  TB    512   B +  0 B   VDV10152
/dev/nvme2n1     46b8b1d8f0c22a37     Lightbits LightOS                        1471        5.37  GB /   5.37  GB      4 KiB +  0 B   2.3
/dev/nvme2n10    46b8b1d8f0c22a37     Lightbits LightOS                        201        85.90  GB /  85.90  GB      4 KiB +  0 B   2.3
/dev/nvme2n11    46b8b1d8f0c22a37     Lightbits LightOS                        2189       80.53  GB /  80.53  GB      4 KiB +  0 B   2.3
/dev/nvme2n12    46b8b1d8f0c22a37     Lightbits LightOS                        1223       21.47  GB /  21.47  GB      4 KiB +  0 B   2.3
/dev/nvme2n13    46b8b1d8f0c22a37     Lightbits LightOS                        71          6.44  GB /   6.44  GB      4 KiB +  0 B   2.3
/dev/nvme2n16    46b8b1d8f0c22a37     Lightbits LightOS                        363         9.66  GB /   9.66  GB      4 KiB +  0 B   2.3
/dev/nvme2n2     46b8b1d8f0c22a37     Lightbits LightOS                        1468        5.37  GB /   5.37  GB      4 KiB +  0 B   2.3
/dev/nvme2n3     46b8b1d8f0c22a37     Lightbits LightOS                        562         5.37  GB /   5.37  GB      4 KiB +  0 B   2.3
/dev/nvme2n4     46b8b1d8f0c22a37     Lightbits LightOS                        29          5.37  GB /   5.37  GB      4 KiB +  0 B   2.3
/dev/nvme2n5     46b8b1d8f0c22a37     Lightbits LightOS                        850         1.07  GB /   1.07  GB      4 KiB +  0 B   2.3
/dev/nvme2n6     46b8b1d8f0c22a37     Lightbits LightOS                        184        42.95  GB /  42.95  GB      4 KiB +  0 B   2.3
/dev/nvme2n7     46b8b1d8f0c22a37     Lightbits LightOS                        194         1.07  GB /   1.07  GB      4 KiB +  0 B   2.3
/dev/nvme2n8     46b8b1d8f0c22a37     Lightbits LightOS                        51        214.75  GB / 214.75  GB      4 KiB +  0 B   2.3

To solve this problem we are thinking of adding a additional daemonset to all of our clusters which executes

echo 1 > /sys/devices/virtual/nvme-fabrics/ctl/nvme*/rescan_controller

at a regular intervall.

But we think this would be better to have this running in a goroutine in the lb-csi-node DS.

I am happy to provide a PR which implements this

WDYT ?

sagi-lb commented 1 year ago

@majst01 what kernel version are you running? There shouldn't be a need to do any manual rescan to paper over a nvme driver issue.

majst01 commented 1 year ago

@majst01 what kernel version are you running? There shouldn't be a need to do any manual rescan to paper over a nvme driver issue.

We actually have ~1000 Worker nodes in the fleet with different kernel versions in place, we can only provide updated worker images, if customers update/replace their worker nodes with more recent ones is out of our control.

@sagi-lb are you aware of a kernel issue which leads to such a situation ? As i said, this is quite rare issue, but hurts.

sagi-lb commented 1 year ago

@majst01 there has been some issues with respect to error recovery in lots of kernel versions, usually stable kernels catch up with the fixes, but this is not always the case. It would help to know what is the kernel version and nvme-cli version you run in this particular host that you quoted ran into the problem.

p.s. I very much doubt that trigger controller_rescan is a viable solution, it may have happened to work, but I doubt it would work every time. What it does is simply trigger a namespace rescan (i.e. rescan of the nvme-namespaces of a particular controller - i.e. volumes). The fact that it actually did anything suggest that it was either not triggered, or failed. But when the host reconnects it will do this rescan every time.

sagi-lb commented 1 year ago

@majst01 btw, do you still run infinit nvme IO timeout? That is possibly a source for issues of this kind, where a process that relies on I/O to timeout or fail in order to make forward progress (in this case namespace scanning).

I would very much suggest to leave the io_timeout modparam to its default value.

majst01 commented 1 year ago

@majst01 btw, do you still run infinit nvme IO timeout? That is possibly a source for issues of this kind, where a process that relies on I/O to timeout or fail in order to make forward progress (in this case namespace scanning).

I would very much suggest to leave the io_timeout modparam to its default value.

Will do !

sagi-lb commented 1 year ago

note @majst01 that the IO timeout modparam may not directly relate to this particular issue, because the ns scans issue admin commands, do you also set admin_timeout to infinity (default is 60 seconds)?

majst01 commented 1 year ago

admin_timeout is not touched by us

majst01 commented 1 year ago

@majst01 btw, do you still run infinit nvme IO timeout? That is possibly a source for issues of this kind, where a process that relies on I/O to timeout or fail in order to make forward progress (in this case namespace scanning).

I would very much suggest to leave the io_timeout modparam to its default value.

@sagi-lb One question regarding io_timeout was raised in our team, given the fact that we have always a firewall between the worker nodes which mounts the lightbits volume and the lightbits storage servers, from time to time this firewall needs to be updated/replaced which causes a short network downtime in the worker -> storage communication. This downtime is usually in the range of 10sec but could be longer. What if the downtime is longer than io_timeout ? Will the filesystem get corrupt or simply blocks for the connected application for io_timeout and then the caller get cancelled?

sagi-lb commented 1 year ago

@majst01 Even when I/O timeout expires, the application will not see the error, because the nvme multipath layer will requeue all pending commands when a path (i.e. live controller) becomes available again (reconnects every 10 seconds or so). When a host sees an error in the connection to a controller, it tears down the connection, cancels all pending I/O (and the nvme-multipath layer requeues them because it still has alternate paths, that just happen to be inaccessible at this time, so IO will block).

In fact, even without I/O timeout, there is a keep-alive command that runs periodically to test connectivity, and its timeout is 5 or 10 seconds (default depends on the kernel version). So this probably already happens today.

I suggest to apply this change and test it in your dev clusters, and see that no spurious things happen when testing network failures.

george-lb commented 1 year ago

Hi @majst01 : did this change help you?

majst01 commented 1 year ago

Hi @majst01 : did this change help you?

Still preparing the next OS Image with the changed io_timeout setting here: https://github.com/metal-stack/metal-images/pull/190

But we had this issue once again in one of the production worker nodes, and a manual trigger of a rescan solved the issue.

We still think controller rescan is a valid solution, at least for older kernels

sagi-lb commented 1 year ago

@majst01 would it be possible to share the kernel version and dmesg output from the worker node that experienced this? It would help to understand if a rescan that you run can work always or only in certain cases...

majst01 commented 1 year ago

@majst01 would it be possible to share the kernel version and dmesg output from the worker node that experienced this? It would help to understand if a rescan that you run can work always or only in certain cases...

Yes, once it happens again (thanks god quite seldom) i will tell them to collect these

majst01 commented 1 year ago

OK, regarding io_timeout, i changed that in our new debian-12 images to 300s and tested cutting the network connectivity longer than this. The volumes got back normal once the network was reachable again.

see logs: https://github.com/metal-stack/metal-images/pull/190#issuecomment-1611229647

So this is an improvement we where afraid of for a long time, thanks @sagi-lb

The original Issue here happens seldom on some of the older kernels, there we will fix this manually by calling

find /sys/devices/virtual/nvme-fabrics/ctl/ -name rescan_controller -exec bash -c 'echo 1 > "$0"' {} \;

in case this incident happens again