NetworkBlockDevice / nbd

Network Block Device
GNU General Public License v2.0
450 stars 116 forks source link

Server uses %m for errors when read(2) returns 0 (EOF) and does not set errno, resulting in semi-random error message #83

Closed rwmjones closed 6 years ago

rwmjones commented 6 years ago

About 25% of the time when I connect to nbd-server from the nbd kernel client running on another host, nbd-server dies at connection time with:

Error: Connection dropped: Inappropriate ioctl for device

I collected a stack trace which is not very interesting, it just says that one of the underlying system calls is failing with ENOTTY.

Thread 1 (Thread 0x7ffff62a81c0 (LWP 6042)):
#0  __GI_exit (status=status@entry=1) at exit.c:138
#1  0x000000000040c163 in err (s=s@entry=0x40d0e7 "Connection dropped: %m")
    at cliserv.c:78
#2  0x00000000004055cf in socket_closed_transmission (client=0x43b0f0)
    at nbd-server.c:452
#3  0x000000000040a49c in mainloop_threaded (client=0x43b0f0)
    at nbd-server.c:2753
#4  handle_modern_connection (genconf=0x7fffffffdc00, sock=<optimized out>, 
    servers=0x43ce60) at nbd-server.c:2925
#5  serveloop (servers=servers@entry=0x434660, 
    genconf=genconf@entry=0x7fffffffdc00) at nbd-server.c:3158
#6  0x0000000000404cb2 in main (argc=<optimized out>, argv=<optimized out>)
    at nbd-server.c:3600

I'm just serving a regular local file:

$ ./nbd-server -d 10809 /var/tmp/fedora-27.img
$ ls -l /var/tmp/fedora-27.img 
-rw-r--r--. 1 rjones rjones 6442450944 Sep  1 22:09 /var/tmp/fedora-27.img

The server is current nbd from git, running as non-root on a recent Linux kernel.

The client is Linux kernel nbd module 4.18.5-300.fc29.x86_64 with nbd-client 3.17.

rwmjones commented 6 years ago

errno might be a red herring. Here's the strace output:

openat(AT_FDCWD, "/var/tmp/fedora-27.img", O_RDWR) = 6
ioctl(6, BLKGETSIZE64, 0x7ffc239931f8)  = -1 ENOTTY (Inappropriate ioctl for device)
fstat(6, {st_mode=S_IFREG|0644, st_size=6442450944, ...}) = 0

The exit happens later on.

It seems to fail somewhere during option negotiation, but instead of printing why, it exits and prints the random value of errno from this much earlier (non-)error.

rwmjones commented 6 years ago

This strace seems to make no sense, but here is the tail end where the failure happens:

write(2, "** \33[1;32mMessage\33[0m: \33[34m16:5"..., 64** Message: 16:57:43.734: Starting to serve
) = 64
read(5, "%`\225\23\0\0\0\0\34\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0", 28) = 28
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f84ddb88000
mprotect(0x7f84ddb89000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(strace: Process 6668 attached
child_stack=0x7f84de387fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f84de3889d0, tls=0x7f84de388700, child_tidptr=0x7f84de3889d0) = 6668
[pid  6668] set_robust_list(0x7f84de3889e0, 24 <unfinished ...>
[pid  6642] read(5,  <unfinished ...>
[pid  6668] <... set_robust_list resumed> ) = 0
[pid  6668] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f84d5b88000
[pid  6668] munmap(0x7f84d5b88000, 38240256) = 0
[pid  6668] munmap(0x7f84dc000000, 28868608) = 0
[pid  6668] mprotect(0x7f84d8000000, 135168, PROT_READ|PROT_WRITE) = 0
[pid  6668] prctl(PR_SET_NAME, "pool")  = 0
[pid  6668] pread64(6, "\353c\220\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096
[pid  6668] write(5, "gDf\230\0\0\0\0\34\0\0\0\1\0\0\0", 16) = 16
[pid  6668] write(5, "\353c\220\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid  6668] futex(0x1c9d5b0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=499998761} <unfinished ...>
[pid  6642] <... read resumed> "%`\225\23\0\0\0\0\34\0\0\0\2\0\0\0\0\0\0\0\0\0\20\0\0\0\20\0", 28) = 28
[pid  6642] futex(0x1c9d5b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid  6668] <... futex resumed> )       = 0
[pid  6642] read(5,  <unfinished ...>
[pid  6668] pread64(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 4096) = 4096
[pid  6668] write(5, "gDf\230\0\0\0\0\34\0\0\0\2\0\0\0", 16) = 16
[pid  6668] write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid  6668] futex(0x1c9d5b0, FUTEX_WAIT_PRIVATE, 1, {tv_sec=0, tv_nsec=499998805} <unfinished ...>
[pid  6642] <... read resumed> "%`\225\23\0\0\0\0\34\0\0\0\3\0\0\0\0\0\0\1\177\377\0\0\0\0\20\0", 28) = 28
[pid  6642] futex(0x1c9d5b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid  6668] <... futex resumed> )       = 0
[pid  6642] read(5,  <unfinished ...>
[pid  6668] pread64(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 6442385408) = 4096
[pid  6668] write(5, "gDf\230\0\0\0\0\34\0\0\0\3\0\0\0", 16) = 16
[pid  6668] write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid  6668] futex(0x1c9d5b0, FUTEX_WAIT_PRIVATE, 2, {tv_sec=0, tv_nsec=499998907} <unfinished ...>
[pid  6642] <... read resumed> "%`\225\23\0\0\0\0\34\0\0\0\4\0\0\0\0\0\0\1\177\377\340\0\0\0\20\0", 28) = 28
[pid  6642] futex(0x1c9d5b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid  6668] <... futex resumed> )       = 0
[pid  6642] read(5,  <unfinished ...>
[pid  6668] pread64(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 6442442752) = 4096
[pid  6668] write(5, "gDf\230\0\0\0\0\34\0\0\0\4\0\0\0", 16) = 16
[pid  6668] write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid  6668] futex(0x1c9d5b0, FUTEX_WAIT_PRIVATE, 3, {tv_sec=0, tv_nsec=499998621} <unfinished ...>
[pid  6642] <... read resumed> "", 28)  = 0
[pid  6642] futex(0x1c9d5b0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid  6668] <... futex resumed> )       = 0
[pid  6642] write(2, "Error: Connection dropped: Inapp"..., 58Error: Connection dropped: Inappropriate ioctl for device
 <unfinished ...>
[pid  6668] futex(0x1c9d090, FUTEX_WAIT_PRIVATE, 0, {tv_sec=14, tv_nsec=999998934} <unfinished ...>
[pid  6642] <... write resumed> )       = 58
[pid  6642] write(2, "Exiting.\n", 9Exiting.
)   = 9
[pid  6642] exit_group(1)               = ?
[pid  6668] <... futex resumed>)        = ?
[pid  6668] +++ exited with 1 +++
+++ exited with 1 +++
rwmjones commented 6 years ago

Oh I see, the read(2) call returns 0 (EOF) so it must be the client which is closing the connection for some reason. On the client side I see:

block nbd0: Connection timed out
[   54.756467] block nbd0: shutting down sockets
[   54.759558] print_req_error: I/O error, dev nbd0, sector 12582896
[   54.763358] print_req_error: I/O error, dev nbd0, sector 12582896
[   54.767949] Buffer I/O error on dev nbd0, logical block 1572862, async page read

Note there is still a server bug here: The server shouldn't use %m in the EOF case because in that case read(2) does not set errno.

rwmjones commented 6 years ago

The client error is probably https://github.com/NetworkBlockDevice/nbd/issues/71. I will try with a newer nbd-client. I've changed the bug summary to reflect the server-side bug.