Open majst01 opened 1 year ago
@majst01 what kernel version are you running? There shouldn't be a need to do any manual rescan to paper over a nvme driver issue.
@majst01 what kernel version are you running? There shouldn't be a need to do any manual rescan to paper over a nvme driver issue.
We actually have ~1000 Worker nodes in the fleet with different kernel versions in place, we can only provide updated worker images, if customers update/replace their worker nodes with more recent ones is out of our control.
@sagi-lb are you aware of a kernel issue which leads to such a situation ? As i said, this is quite rare issue, but hurts.
@majst01 there has been some issues with respect to error recovery in lots of kernel versions, usually stable kernels catch up with the fixes, but this is not always the case. It would help to know what is the kernel version and nvme-cli version you run in this particular host that you quoted ran into the problem.
p.s. I very much doubt that trigger controller_rescan is a viable solution, it may have happened to work, but I doubt it would work every time. What it does is simply trigger a namespace rescan (i.e. rescan of the nvme-namespaces of a particular controller - i.e. volumes). The fact that it actually did anything suggest that it was either not triggered, or failed. But when the host reconnects it will do this rescan every time.
@majst01 btw, do you still run infinit nvme IO timeout? That is possibly a source for issues of this kind, where a process that relies on I/O to timeout or fail in order to make forward progress (in this case namespace scanning).
I would very much suggest to leave the io_timeout modparam to its default value.
@majst01 btw, do you still run infinit nvme IO timeout? That is possibly a source for issues of this kind, where a process that relies on I/O to timeout or fail in order to make forward progress (in this case namespace scanning).
I would very much suggest to leave the io_timeout modparam to its default value.
Will do !
note @majst01 that the IO timeout modparam may not directly relate to this particular issue, because the ns scans issue admin commands, do you also set admin_timeout to infinity (default is 60 seconds)?
admin_timeout is not touched by us
@majst01 btw, do you still run infinit nvme IO timeout? That is possibly a source for issues of this kind, where a process that relies on I/O to timeout or fail in order to make forward progress (in this case namespace scanning).
I would very much suggest to leave the io_timeout modparam to its default value.
@sagi-lb One question regarding io_timeout
was raised in our team, given the fact that we have always a firewall between the worker nodes which mounts the lightbits volume and the lightbits storage servers, from time to time this firewall needs to be updated/replaced which causes a short network downtime in the worker -> storage communication.
This downtime is usually in the range of 10sec but could be longer. What if the downtime is longer than io_timeout
? Will the filesystem get corrupt or simply blocks for the connected application for io_timeout
and then the caller get cancelled?
@majst01 Even when I/O timeout expires, the application will not see the error, because the nvme multipath layer will requeue all pending commands when a path (i.e. live controller) becomes available again (reconnects every 10 seconds or so). When a host sees an error in the connection to a controller, it tears down the connection, cancels all pending I/O (and the nvme-multipath layer requeues them because it still has alternate paths, that just happen to be inaccessible at this time, so IO will block).
In fact, even without I/O timeout, there is a keep-alive command that runs periodically to test connectivity, and its timeout is 5 or 10 seconds (default depends on the kernel version). So this probably already happens today.
I suggest to apply this change and test it in your dev clusters, and see that no spurious things happen when testing network failures.
Hi @majst01 : did this change help you?
Hi @majst01 : did this change help you?
Still preparing the next OS Image with the changed io_timeout setting here: https://github.com/metal-stack/metal-images/pull/190
But we had this issue once again in one of the production worker nodes, and a manual trigger of a rescan solved the issue.
We still think controller rescan is a valid solution, at least for older kernels
@majst01 would it be possible to share the kernel version and dmesg output from the worker node that experienced this? It would help to understand if a rescan that you run can work always or only in certain cases...
@majst01 would it be possible to share the kernel version and dmesg output from the worker node that experienced this? It would help to understand if a rescan that you run can work always or only in certain cases...
Yes, once it happens again (thanks god quite seldom) i will tell them to collect these
OK, regarding io_timeout, i changed that in our new debian-12 images to 300s and tested cutting the network connectivity longer than this. The volumes got back normal once the network was reachable again.
see logs: https://github.com/metal-stack/metal-images/pull/190#issuecomment-1611229647
So this is an improvement we where afraid of for a long time, thanks @sagi-lb
The original Issue here happens seldom on some of the older kernels, there we will fix this manually by calling
find /sys/devices/virtual/nvme-fabrics/ctl/ -name rescan_controller -exec bash -c 'echo 1 > "$0"' {} \;
in case this incident happens again
Rarely we see this error message in some clusters:
Then it is not possible anymore to create and mount new PVCs.
We found that it is possible to recover from this situation without rebooting the affected worker node by running
nvme list
twice.First attempt:
Then the second attempt:
To solve this problem we are thinking of adding a additional daemonset to all of our clusters which executes
at a regular intervall.
But we think this would be better to have this running in a goroutine in the lb-csi-node DS.
I am happy to provide a PR which implements this
WDYT ?