Closed alexeicolin closed 6 years ago
Which kernel are you using? Can you repeat that with the latest kernel?
4.13.7 - master doesn't clear the bit either, last time I looked a few days ago.
So, my patch to clear the NBD_BOUND did have an effect but did not solve the issue. With the patch, I no longer get the error about 'being setup by another task' after issuing the ioctl to DISCONNECT and CLEAR_SOCK, but /dev/ndb0 is still not reconnectable. I can connect to /dev/nbd1.
Btw, the same symptom happened when /dev/sda device on the server threw a SATA exception (bad HW). The client couldn't reconnect as above.
So, some error handling is not fully correct, maybe? Besides the NBD_BOUND never getting cleared.
Was this fixed with #61, or is this a separate issue?
I fixed a few of these issues a couple of months ago, could you try a recent 4.18 kernel and tell me if the problem is still happening?
I did a test which is similar to original situation: pull the drive out of the server, to get SATA failure. Client gets I/O error. Then, restart the nbd-server. Stop nbd client and start nbd client again. Client reconnected on same nbd# successfully. The issue was entirely on client side, as far as I remember. So, appears to be fixed. Thanks.
Tested nbd-server 3.16.2 in kenrel 4.18.1, and nbd-client 3.17 on kernel 4.18.3.
After my system lost network connectivity (due to kernel bug, workaround is ip link set down/up), nbd-client won't reconnect.
Kernel log:
I wrote a C utility to send ioctl (DISCONNECT and CLEAR_SOCK), and I see log statements from nbd.c in the kernel log confirming the commands. But the above error condition still triggers.
I see that in the code NBD_BOUND is set, but is never reset. Is that by design? Should it be reset in the disconnect handler where task is set to NULL? PS. FWIW, I added the clear of this bit there, and am running the modified kernel. If the network failure occurs again, I'll see if the patch changes anything.