Open mat813 opened 8 years ago
This seems to happen to very short lived process only, like which
, test
…
Please try this patch against ports-mgmt/poudriere-devel
: https://people.freebsd.org/~bdrewery/patches/poudriere-jexecd-no-reaper-race.diff
The second patch you gave me on irc fixes the problem for me.
Ah, spoke too soon, just had a:
rexec -s /poudriere/data/.m/9amd64-pkgng/ref/../9amd64-pkgng.sock -u root test -f /lib/libexpat.so
hang on me
I'll try to address it in the next few days.
I just hit this on my restore-jexecd
branch where I fixed support for multi-process. Initially I just excluded it from the deps phase which was good. I'm trying to enable it though.
(gdb) bt
#0 _recvfrom () at _recvfrom.S:4
#1 0x000000000020c6da in buf_recv (sock=20, buf=0x7fffffffe308, size=<optimized out>) at external/libnv/msgio.c:464
#2 0x0000000000209ad5 in nvlist_recv (sock=20, flags=0) at external/libnv/contrib/nvlist.c:1306
#3 0x0000000000206a9c in client_read (cl=<optimized out>) at src/libexec/poudriere/jexecd/jexecd.c:114
#4 serve (fd=5) at src/libexec/poudriere/jexecd/jexecd.c:269
#5 0x000000000020671d in main (argc=<optimized out>, argv=<optimized out>) at src/libexec/poudriere/jexecd/jexecd.c:422
(gdb)
jexecd is totally idle, no children, bunch of rexec hanging around.
~/git/poudriere2 # procstat -f 82210
PID COMM FD T V FLAGS REF OFFSET PRO NAME
82210 jexecd text v r r------- - - - /root/git/poudriere/jexecd
82210 jexecd cwd v d r------- - - - /
82210 jexecd root v d r------- - - - /poudriere/data/.m/exp-11amd64-commit-test/ref
82210 jexecd jail v d r------- - - - /poudriere/data/.m/exp-11amd64-commit-test/ref
82210 jexecd 0 v c rwa----- 300 483194820 - /dev/pts/10
82210 jexecd 1 v c rwa----- 300 483194820 - /dev/pts/10
82210 jexecd 2 v c rwa----- 300 483194820 - /dev/pts/10
82210 jexecd 3 v d r----n-- 1 0 - /poudriere/data/.m/exp-11amd64-commit-test/ref/.p/var/run
82210 jexecd 4 v r -w---n-l 1 0 - /poudriere/data/.m/exp-11amd64-commit-test/ref/.p/var/run/exp-11amd64-commit-test.pid
82210 jexecd 5 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 6 k - rw------ 1 0 - -
82210 jexecd 7 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 8 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 9 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 10 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 11 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 12 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 13 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 14 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 15 s - rw------ 2 0 UDS 0 999 exp-11amd64-commit-test.sock
82210 jexecd 16 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 17 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 18 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 19 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 20 s - rw------ 3 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 21 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 22 s - rw------ 2 0 UDS 0 1000 exp-11amd64-commit-test.sock
82210 jexecd 23 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd 24 s - rw------ 2 0 UDS 0 0 exp-11amd64-commit-test.sock
~/git/poudriere2 # procstat -kk 82210
PID TID COMM TDNAME KSTACK
82210 101343 jexecd - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _sleep+0x2c3 sbwait+0x67 soreceive_generic+0x274 soreceive+0x59 kern_recvit+0x1e7 sys_recvfrom+0x86 amd64_syscall+0x2d6 fast_syscall_common+0x101
~/git/poudriere2 # procstat -kka|grep rexec
46194 101580 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _sleep+0x2c3 sbwait+0x67 soreceive_generic+0x274 soreceive+0x59 kern_recvit+0x1e7 sys_recvfrom+0x86 amd64_syscall+0x2d6 fast_syscall_common+0x101
46395 102151 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46594 102158 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46796 102283 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46814 101164 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46822 102113 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46851 102779 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46854 102599 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46870 102286 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46872 102855 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46875 101619 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46885 102956 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46887 101486 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46896 102951 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46908 101887 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46917 102126 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46918 102885 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46925 101256 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46932 102879 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46936 102414 rexec - mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
Looks like jexecd and rexec 46194 are both trying to read from the socket at the same time. Not sure if that's a problem. Not sure what the 999/1000 are up there either.
82210 jexecd 20 s - rw------ 3 0 UDS 0 0 exp-11amd64-commit-test.sock
That's the fd being read from in jexecd. It has the most references as well.
The rexec
pid 46194 from above is also in nvlist_recv
:
#0 _recvfrom () at _recvfrom.S:4
#1 0x000000000020ae4a in buf_recv (sock=4, buf=0x7fffffffe0b8, size=<optimized out>) at external/libnv/msgio.c:464
#2 0x0000000000208245 in nvlist_recv (sock=4, flags=0) at external/libnv/contrib/nvlist.c:1306
#3 0x000000000020550e in main (argc=<optimized out>, argv=<optimized out>) at src/libexec/poudriere/rexec/rexec.c:104
I guess it's just bogus to have multiple rexec
trying to recvfrom
because they might eat a command being sent from another rexec
to jexecd
. All processes involved race to waiting
forever in recvfrom
.
The issue seems to be that the child process exits before it gets added to kevent. So it never gets the exit notice to return to rexec
.
During the package cleaning stage at the start, only one jexecd is used multiple times in parallel, and it seems sometime, one rexec does not get the message saying "hey, it's over, you can die" so it gets stuck, killing it unstucks the build.
This is on a 10.1.