chaos / diod

Distributed I/O Daemon - a 9P file server
GNU General Public License v2.0
349 stars 56 forks source link

make check hangs in tests/kern #61

Open garlick opened 4 years ago

garlick commented 4 years ago

Running make check as root down in tests/kern hangs at test t05. My kernel is 5.4.0-7634-generic (ubuntu 20.04 LTS).

This may be a dup of #23 which was against linux-next in 2015, but wanted to open up a new bug until that is confirmed.

garlick commented 3 years ago

Just focusing in on t05 which is the first failing test, we do get a hang on kernel 5.10.63.

The test script only runs /bin/true and as expected, it has succeeded.

Test output t05.out:

kconjoin: diodmount exited with rc=0
kconjoin: t05 exited with rc=0

and diod log t05.diod contains

diod: P9_TVERSION tag 65535 msize 65536 version '9P2000.L'
diod: P9_RVERSION tag 65535 msize 65536 version '9P2000.L'
diod: P9_TAUTH tag 0 afid 0 uname '' aname '/tmp/tmp.vBZM9TQ04J' n_uname 0
diod: P9_RLERROR tag 0 ecode 2
diod: P9_TATTACH tag 0 fid 0 afid -1 uname '' aname '/tmp/tmp.vBZM9TQ04J' n_uname 0
diod: P9_RATTACH tag 0 qid (000000000001fcac 0 'd')
diod: P9_TCLUNK tag 0 fid 0
diod: P9_RCLUNK tag 0
diod: P9_TVERSION tag 65535 msize 65536 version '9P2000.L'
diod: P9_RVERSION tag 65535 msize 65536 version '9P2000.L'
diod: P9_TATTACH tag 0 fid 0 afid -1 uname 'root' aname '/tmp/tmp.vBZM9TQ04J' n_uname P9_NONUNAME
diod: P9_RATTACH tag 0 qid (000000000001fcac 0 'd')
diod: P9_TGETATTR tag 0 fid 0 request_mask 0x7ff
diod: P9_RGETATTR tag 0 valid 0x7ff qid (000000000001fcac 0 'd') mode 040755 uid 0 gid 0 nlink 2 rdev 0 size 4096 blksize 4096 blocks 8 atime Tue Oct 26 16:52:16 2021 mtime Tue Oct 26 16:52:16 2021 ctime Tue Oct 26 16:52:16 2021 btime X gen X data_version X

gdb says kconjoin is stuck here:

(gdb) bt
#0  0xb6ddb2a8 in __GI___waitpid (pid=pid@entry=18484, 
    stat_loc=stat_loc@entry=0xbee1d218, options=options@entry=0)
    at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0xb6d744ec in do_system (
    line=line@entry=0xbee1d652 "../../diod/diod  -r80 -w81 -c /dev/null -n -d 1 -L t05.diod -e /tmp/tmp.vBZM9TQ04J") at ../sysdeps/posix/system.c:149
#2  0xb6d749c4 in __libc_system (
    line=line@entry=0xbee1d652 "../../diod/diod  -r80 -w81 -c /dev/null -n -d 1 -L t05.diod -e /tmp/tmp.vBZM9TQ04J") at ../sysdeps/posix/system.c:185
#3  0x00010a40 in main (argc=<optimized out>, argv=<optimized out>)
    at kconjoin.c:133

and diod here

(gdb) bt
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x1f52ffc)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x0, cond=0x1f52fd0)
    at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x1f52fd0, mutex=0x0) at pthread_cond_wait.c:655
#3  0x00028bec in np_srv_wait_conncount (srv=0x1f52f18, count=count@entry=1)
    at srv.c:141
#4  0x00012dbc in _service_run (wfdno=-1227925284, rfdno=<optimized out>, 
    mode=SRV_FILEDES) at diod.c:666
#5  main (argc=<optimized out>, argv=<optimized out>) at diod.c:257

which is this function:

/* Block the caller until the server has no active connections,
 * and there have been at least 'count' connections historically.
 */
void
np_srv_wait_conncount(Npsrv *srv, int count)
{
        xpthread_mutex_lock(&srv->lock);
        while (srv->conncount > 0 || srv->connhistory < count) {
                xpthread_cond_wait(&srv->conncountcond, &srv->lock);
        }
        xpthread_mutex_unlock(&srv->lock);
}

the connection count is 1

(gdb) frame 3
#3  0x00028bec in np_srv_wait_conncount (srv=0x1f52f18, count=count@entry=1)
    at srv.c:141
141         xpthread_cond_wait(&srv->conncountcond, &srv->lock);
(gdb) p srv->conncount
$1 = 1

So the kernel does not clunk the mount when the test program completes.

garlick commented 3 years ago

The private namespace established with CLONE_NEWNS appears to be leaking, since it is visible to all in /proc/mounts:

$ cat /proc/mounts|grep 9p
nohost:/tmp/tmp.YRvf1AVI4r /tmp/tmp.kbqx8vsreA 9p rw,sync,dirsync,relatime,debug=1,uname=root,aname=/tmp/tmp.YRvf1AVI4r,access=user,msize=65536,trans=fd,rfd=80,wfd=81 0 0

sudo umount /tmp/tmp.kbqx8vsreA allows the test to complete successfully.

garlick commented 3 years ago

This seems to resolve the issue.

diff --git a/tests/kern/kconjoin.c b/tests/kern/kconjoin.c
index 83f08b0..4e32342 100644
--- a/tests/kern/kconjoin.c
+++ b/tests/kern/kconjoin.c
@@ -114,6 +114,13 @@ main (int argc, char *argv[])
             _movefd (fromsrv[0], RFDNO);
             if (unshare (CLONE_NEWNS) < 0)
                 err_exit ("unshare");
+            /* Change root propagation to private within this namespace,
+             * as systemd may have mounted root with it set to shared,
+             * and then the 9p mount will leak into the main namespace and
+             * not be automatically unmounted when the test completes.
+             */
+            system ("mount --make-private /");
+
             if ((cs = system (mntcmd)) == -1)
                 err_exit ("failed to run %s", _cmd (mntcmd));
             if (_interpret_status (cs, _cmd (mntcmd)))
garlick commented 3 months ago

Still some cleanup issues with that fix applied. After running the test I get

$ df
df: /tmp/tmp.vExWhjHnlw: Input/output error
df: /tmp/tmp.xsXEoaK3x6: Input/output error
df: /tmp/tmp.BUHhy3kS6C: Input/output error
df: /tmp/tmp.EVMsv8hhgf: Input/output error
df: /tmp/tmp.orlCjxATgT: Input/output error
df: /tmp/tmp.F8bpFfCXZL: Input/output error
df: /tmp/tmp.REyd5kqrJh: Input/output error
df: /tmp/tmp.wF4OUnufwA: Input/output error
df: /tmp/tmp.cMaEsks0sl: Input/output error
df: /tmp/tmp.AaFdxFdXw2: Input/output error
df: /tmp/tmp.Hx98fIqnQH: Input/output error
df: /tmp/tmp.rTdvkUeHyg: Input/output error
df: /tmp/tmp.SGGAtLbvHa: Input/output error
df: /tmp/tmp.iii51W8iiM: Input/output error
df: /tmp/tmp.8NZQKsmVQj: Input/output error
df: /tmp/tmp.FavfPS4Ffx: Input/output error
df: /tmp/tmp.ouYPdkmvU5: Input/output error
df: /tmp/tmp.LLXGbJAnaN: Input/output error