server fails with glib2 2.76.0

eworm-de commented 1 year ago

Have been wondering why my clients fail to boot via PXE, where NBD hangs... After some debugging it turns out the client is fine, but server behaves bad due to an update of glib2 to version 2.76.0.

No idea what change caused this... I will try to dig deeper.

eworm-de commented 1 year ago

Starting nbd-server with -d makes the issue go away...

eworm-de commented 1 year ago

Bisected and found this to be the commit causing the trouble: https://gitlab.gnome.org/GNOME/glib/-/commit/4d2e77a5546cccc10f83d103d6f1527a5c9bcf13

Looks like threading is broken here...

yoe commented 1 year ago

Sigh.

That linked page says:

As such, get rid of all the complicated code that tried to do this better

That's not very promising...

eworm-de commented 1 year ago

I think the complicated code was referencing what was removed in glib, not what we have in nbd. 😜

I am no very familiar with that code, so no idea how to proceed from here. Let me know if I can help to further narrow this.

Sadly the glib2 package was moved to Arch's official repositories before this issue was noticed. So currently the nbd server is non-functional without extra action.

yoe commented 1 year ago

I think the complicated code was referencing what was removed in glib, not what we have in nbd.

Yes, that was clear :-)

I am no very familiar with that code, so no idea how to proceed from here. Let me know if I can help to further narrow this.

One thing you could do is run the "drd" tool of valgrind:

valgrind --tool=drd nbd-server ...

and post the output. Perhaps there's a bug in my code, but I'm not sure; It feels like a bug in glib to me, not in nbd.

Would you be willing to file a bug report against glib? Feel free to refer to this one -- and please also post the link to it in this bug report. I'll follow up there if necessary.

Sadly the glib2 package was moved to Arch's official repositories before this issue was noticed. So currently the nbd server is non-functional without extra action.

Yeah, not great :-(

eworm-de commented 1 year ago

One thing you could do is run the "drd" tool of valgrind:

valgrind --tool=drd nbd-server ...

and post the output. Perhaps there's a bug in my code, but I'm not sure; It feels like a bug in glib to me, not in nbd.

Ran the command as given. As nbd forks to background the command finish, with this output:

==31341== drd, a thread error detector                                                                                                                                                     
==31341== Copyright (C) 2006-2020, and GNU GPL'd, by Bart Van Assche.                                                                                                                      
==31341== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info                                                                                                               
==31341== Command: nbd-server                                                                                                                                                              
==31341==                                                                                                                                                                                  
==31341==                                                                                                                                                                                  
==31341== For lists of detected and suppressed errors, rerun with: -s                                                                                                                      
==31341== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)                                                                                                                   
--31392-- WARNING: unhandled amd64-linux syscall: 441                                                                                                                        
--31392-- You may be able to write your own handler.                                                                                                                                       
--31392-- Read the file README_MISSING_SYSCALL_OR_IOCTL.                                                                                                                                   
--31392-- Nevertheless we consider this a bug.  Please report                                                                                                                              
--31392-- it at http://valgrind.org/support/bug_reports.html.

The forked process was still running in background, so I connected to nbd and saw the bad behavior. Then kill the processes:

==31563==                                                                                    
==31563== Process terminating with default action of signal 15 (SIGTERM)
==31563==    at 0x4D060DB: syscall (syscall.S:37)
==31563==    by 0x496D7B4: g_cond_wait (gthread-posix.c:1475)
==31563==    by 0x49736B3: g_thread_pool_start_thread.constprop.0 (gthreadpool.c:481)
==31563==    by 0x4949682: g_thread_pool_push (gthreadpool.c:711)
==31563==    by 0x1140A9: mainloop_threaded.isra.0 (nbd-server.c:2925)
==31563==    by 0x10D4DC: UnknownInlinedFun (nbd-server.c:3062)
==31563==    by 0x10D4DC: UnknownInlinedFun (nbd-server.c:3295)
==31563==    by 0x10D4DC: main (nbd-server.c:3740)
==31563== 
==31563== For lists of detected and suppressed errors, rerun with: -s
==31563== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 14 from 6)
==31392== The object at address 0x4fcd020 is not a mutex.
==31392==    at 0x4853DF7: pthread_mutex_destroy_intercept (drd_pthread_intercepts.c:912)
==31392==    by 0x4853DF7: pthread_mutex_destroy@* (drd_pthread_intercepts.c:922)
==31392==    by 0x4ECD0DC: ??? (in /usr/lib/libp11-kit.so.0.3.0)
==31392==    by 0x40010E1: _dl_call_fini (dl-call_fini.c:43)
==31392==    by 0x4004F4D: _dl_fini (dl-fini.c:114)
==31392==    by 0x4C41444: __run_exit_handlers (exit.c:108)
==31392==    by 0x4C415AF: exit (exit.c:138)
==31392==    by 0x10C51B: UnknownInlinedFun (nbd-server.c:3227)
==31392==    by 0x10C51B: main (nbd-server.c:3740)
==31392== 
==31392== The object at address 0x4fcd060 is not a mutex.
==31392==    at 0x4853DF7: pthread_mutex_destroy_intercept (drd_pthread_intercepts.c:912)
==31392==    by 0x4853DF7: pthread_mutex_destroy@* (drd_pthread_intercepts.c:922)
==31392==    by 0x4ECD0E9: ??? (in /usr/lib/libp11-kit.so.0.3.0)
==31392==    by 0x40010E1: _dl_call_fini (dl-call_fini.c:43)
==31392==    by 0x4004F4D: _dl_fini (dl-fini.c:114)
==31392==    by 0x4C41444: __run_exit_handlers (exit.c:108)
==31392==    by 0x4C415AF: exit (exit.c:138)
==31392==    by 0x10C51B: UnknownInlinedFun (nbd-server.c:3227)
==31392==    by 0x10C51B: main (nbd-server.c:3740)
==31392== 
==31392== 
==31392== For lists of detected and suppressed errors, rerun with: -s
==31392== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 14 from 6)

So looks like valgrind finds errors, but no idea what to blame.

Would you be willing to file a bug report against glib? Feel free to refer to this one -- and please also post the link to it in this bug report. I'll follow up there if necessary.

Will do next week with some spare time if we do not solve it here.

eworm-de commented 1 year ago

Building with --without-gnutls gives the same behavior, but no errors from valgrind. 🤔

eworm-de commented 1 year ago

Just opened an issue for glib...

Rijndael1998 commented 1 year ago

My issue is similar, but not quite the same. My server starts (I think) and I can connect from a client. The drives show up with their sizes but when reading (cat /dev/nbd0 or dd if=/dev/nbd0), the read hangs indefinitely. I'm not going to try writing to drives for obvious reasons.

I'm not sure what could be causing this and I'm not sure how I can troubleshoot, but I'll be happy to help if I can.

I do know that glib version is 2.76.0-1 on Arch Linux.

eworm-de commented 1 year ago

Sounds exactly the same.

Rijndael1998 commented 1 year ago

Downgrading to glib 2.74.6-1 on Arch Linux fixes this server issue...

eworm-de commented 1 year ago

We have a reaction from glib upstream asking for more information... But I am not that familiar with the nbd code. @yoe, can you comment the discussion and add some insight?

gavinhungry commented 1 year ago

Also seeing this with glib2 2.76.1 on Arch Linux.

As an imperfect alternative to downgrading glib2, you can also override the nbd.service unit (if using systemd) to disable forking with -d, as pointed out by @eworm-de in https://github.com/NetworkBlockDevice/nbd/issues/146#issuecomment-1470346205:

$ sudo systemctl edit nbd.service

# /etc/systemd/system/nbd.service.d/override.conf

[Service]
Type=simple
ExecStart=
ExecStart=/usr/bin/nbd-server -d
Restart=always

$ sudo systemctl daemon-reload
$ sudo systemctl restart nbd.service

hmmsjan commented 1 year ago

Here on the new Fedora 38 the same problem. I tried some debugging. It looks like a thread pool does not survive a fork, even in a small test program. After moving "g_thread_pool_new" into "spawn_child", the server works, but I do not know whether, apart from undefined genconf.threads, new problems will be introduced

The test program printing some output via g_thread_pool_push runs in Debian 11, but freezes in Fedora 38 if the pool is created before the fork.

Digging a bit further with the example program: Just behind g_thread_pool_new, a second thread is created:

Id Target Id Frame

1 Thread 0x7ffff7ca6740 (LWP 5841) "ex1" main () at ex1.cpp:26 2 Thread 0x7ffff78206c0 (LWP 5844) "pool-spawner" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

If I understand correctly, threads are not duplicated during fork, so probably having a g_thread_pool_new before and g_thread_pool_push after a fork is a recipe for waiting forever. I assume some rearrangements in nbd-server are required....

yoe commented 1 year ago

Okay, thanks for that.

I don't think we do anything important with the thread pool before the fork, so moving it to be called after the fork should be doable. I'll look at that and publish a new version ASAP.

hmmsjan commented 1 year ago

I made a littile patch for Fedora, leaving the pool_new on the same place if nofork and doing the pool_new after the fork if !nofork. Plus extra parameter to get the number of threads in that function. It compiled and performed all tests. And I could play a DVD on nbd0 and swap on nbd1, so this looks promising.

eworm-de commented 1 year ago

Can you share your patch and/or open a pull request?

eworm-de commented 1 year ago

Thanks a lot!

hmmsjan commented 1 year ago

Sorry for the delay. I see the problem has been solved in meantime. This latest version has some more differences with the Fedora provided version 3.24. Based on the Fedora version 3.24, only move of the g_thread_pool_new call to the location as in the latest version breaks the inetd test. The patch below passes all tests during build of Fedora 38 rpm.

--- nbd-3.24/nbd-server.c.orig  2023-04-27 22:53:38.526801760 +0200
+++ nbd-3.24/nbd-server.c       2023-04-27 22:52:22.954792077 +0200
@@ -3018,8 +3018,8 @@
                         return;
                 }
                 /* Child just continues. */
+               tpool = g_thread_pool_new(handle_request, NULL, genconf->threads, FALSE, NULL);
         }
-
         sock_flags_old = fcntl(net, F_GETFL, 0);
         if (sock_flags_old == -1) {
                 msg(LOG_ERR, "Failed to get socket flags");
@@ -3704,7 +3704,8 @@
 #if HAVE_OLD_GLIB
        g_thread_init(NULL);
 #endif
-       tpool = g_thread_pool_new(handle_request, NULL, genconf.threads, FALSE, NULL);
+       if(dontfork)
+               tpool = g_thread_pool_new(handle_request, NULL, genconf.threads, FALSE, NULL);

        setup_servers(servers, genconf.modernaddr, genconf.modernport,
                        genconf.unixsock, genconf.flags);

cheese commented 1 year ago

Any chance we have a new release?

yoe commented 1 year ago

Any chance we have a new release?

Yes, it's high on my todo list now.

However, before that can happen, I have a few loose ends to finish up; the structured replies patch broke the data logging in the transaction log, and on top of that nbd-trdump also needs to be taught about structured replies. I also don't feel confident enough that the patch which I applied will work sufficiently well; I'd like to do some more in-depth testing.

I'm hoping to be able to push that out sometime next week. Of course, if you feel up to it, help with patches or extra testing is always welcome 😉

yoe commented 1 year ago

I'm hoping to be able to push that out sometime next week. Of course, if you feel up to it, help with patches or extra testing is always welcome wink

Scratch that, I found time today. Release should be upcoming soon.

NetworkBlockDevice / nbd

server fails with glib2 2.76.0 #146