NetworkBlockDevice / nbd

Network Block Device
GNU General Public License v2.0
452 stars 119 forks source link

'block nbd0: Device being setup by another task' even after DISCONNECT and CLEAR_SOCK #60

Closed alexeicolin closed 6 years ago

alexeicolin commented 6 years ago

After my system lost network connectivity (due to kernel bug, workaround is ip link set down/up), nbd-client won't reconnect.

nbd-client[6704]: Negotiation: ..size = 76300MB
nbd-client[6704]: bs=512, sz=80006348800 bytes
nbd_client[6704]: Kernel doesn't support multiple connections
nbd_client[6704]: Exiting.
nbd-client[6704]: Error: Kernel doesn't support multiple connections
nbd-client[6704]: Exiting.

Kernel log:

block nbd0: Device being setup by another task

I wrote a C utility to send ioctl (DISCONNECT and CLEAR_SOCK), and I see log statements from nbd.c in the kernel log confirming the commands. But the above error condition still triggers.

I see that in the code NBD_BOUND is set, but is never reset. Is that by design? Should it be reset in the disconnect handler where task is set to NULL? PS. FWIW, I added the clear of this bit there, and am running the modified kernel. If the network failure occurs again, I'll see if the patch changes anything.

yoe commented 6 years ago

Which kernel are you using? Can you repeat that with the latest kernel?

alexeicolin commented 6 years ago

4.13.7 - master doesn't clear the bit either, last time I looked a few days ago.

So, my patch to clear the NBD_BOUND did have an effect but did not solve the issue. With the patch, I no longer get the error about 'being setup by another task' after issuing the ioctl to DISCONNECT and CLEAR_SOCK, but /dev/ndb0 is still not reconnectable. I can connect to /dev/nbd1.

Btw, the same symptom happened when /dev/sda device on the server threw a SATA exception (bad HW). The client couldn't reconnect as above.

So, some error handling is not fully correct, maybe? Besides the NBD_BOUND never getting cleared.

yoe commented 6 years ago

Was this fixed with #61, or is this a separate issue?

josefbacik commented 6 years ago

I fixed a few of these issues a couple of months ago, could you try a recent 4.18 kernel and tell me if the problem is still happening?

alexeicolin commented 6 years ago

I did a test which is similar to original situation: pull the drive out of the server, to get SATA failure. Client gets I/O error. Then, restart the nbd-server. Stop nbd client and start nbd client again. Client reconnected on same nbd# successfully. The issue was entirely on client side, as far as I remember. So, appears to be fixed. Thanks.

Tested nbd-server 3.16.2 in kenrel 4.18.1, and nbd-client 3.17 on kernel 4.18.3.