freebsd / poudriere

Port/Package build and test system
https://github.com/freebsd/poudriere/wiki
BSD 2-Clause "Simplified" License
393 stars 162 forks source link

issue with rexec/jexecd #364

Open mat813 opened 8 years ago

mat813 commented 8 years ago

During the package cleaning stage at the start, only one jexecd is used multiple times in parallel, and it seems sometime, one rexec does not get the message saying "hey, it's over, you can die" so it gets stuck, killing it unstucks the build.

This is on a 10.1.

# pstree -s poudri
-+= 00001 root /sbin/init --
 |-+= 01888 root tmux: server (/tmp/tmux-0/default) (tmux)
 | \-+= 01911 root -bash (bash)
 |   \-+= 95750 root /bin/sh ./update.sh
 |     \-+- 59354 root sh -e /usr/local/share/poudriere/bulk.sh -j 9i386 -p pkgng -J 24:40 -f /poudriere/pkgng -f /poudriere/pkgng-default
 |       |--- 06252 root pwait 14154 433 1404 1446 1777 1779 1784 1787 4020 4268 4772 4789 4796 4801 4837 4841 4844 4849 4853 4896 5057 5061 5064 5070 5078 5117 5129 5160 5203 5318 5342 5379 5703 5718 5844 5917 6163 6168 6173 6182 6186 6227
 |       |-+- 14154 root sh -e /usr/local/share/poudriere/bulk.sh -j 9i386 -p pkgng -J 24:40 -f /poudriere/pkgng -f /poudriere/pkgng-default
 |       | \-+- 14158 root sh -e /usr/local/share/poudriere/bulk.sh -j 9i386 -p pkgng -J 24:40 -f /poudriere/pkgng -f /poudriere/pkgng-default
 |       |   \--- 14842 root rexec -s /poudriere/data/.m/9i386-pkgng/ref/../9i386-pkgng.sock -u root which perl5>=5.20<5.21
 |       |-+- 59593 root sh -e /usr/local/share/poudriere/bulk.sh -j 9i386 -p pkgng -J 24:40 -f /poudriere/pkgng -f /poudriere/pkgng-default
 |       | \--- 47754 root sleep 2
 |       \--- 59598 root sh -e /usr/local/share/poudriere/bulk.sh -j 9i386 -p pkgng -J 24:40 -f /poudriere/pkgng -f /poudriere/pkgng-default
 |--= 59552 root jexecd: poudriere(9i386-pkgng) (jexecd)
 \--= 59555 root jexecd: poudriere(9i386-pkgng-n) (jexecd)
mat813 commented 8 years ago

This seems to happen to very short lived process only, like which, test

bdrewery commented 8 years ago

Please try this patch against ports-mgmt/poudriere-devel: https://people.freebsd.org/~bdrewery/patches/poudriere-jexecd-no-reaper-race.diff

mat813 commented 8 years ago

The second patch you gave me on irc fixes the problem for me.

mat813 commented 8 years ago

Ah, spoke too soon, just had a:

rexec -s /poudriere/data/.m/9amd64-pkgng/ref/../9amd64-pkgng.sock -u root test -f /lib/libexpat.so

hang on me

bdrewery commented 8 years ago

I'll try to address it in the next few days.

bdrewery commented 4 years ago

I just hit this on my restore-jexecd branch where I fixed support for multi-process. Initially I just excluded it from the deps phase which was good. I'm trying to enable it though.

(gdb) bt
#0  _recvfrom () at _recvfrom.S:4
#1  0x000000000020c6da in buf_recv (sock=20, buf=0x7fffffffe308, size=<optimized out>) at external/libnv/msgio.c:464
#2  0x0000000000209ad5 in nvlist_recv (sock=20, flags=0) at external/libnv/contrib/nvlist.c:1306
#3  0x0000000000206a9c in client_read (cl=<optimized out>) at src/libexec/poudriere/jexecd/jexecd.c:114
#4  serve (fd=5) at src/libexec/poudriere/jexecd/jexecd.c:269
#5  0x000000000020671d in main (argc=<optimized out>, argv=<optimized out>) at src/libexec/poudriere/jexecd/jexecd.c:422
(gdb)

jexecd is totally idle, no children, bunch of rexec hanging around.

~/git/poudriere2 # procstat -f 82210
  PID COMM                FD T V FLAGS    REF  OFFSET PRO NAME
82210 jexecd            text v r r-------   -       - -   /root/git/poudriere/jexecd
82210 jexecd             cwd v d r-------   -       - -   /
82210 jexecd            root v d r-------   -       - -   /poudriere/data/.m/exp-11amd64-commit-test/ref
82210 jexecd            jail v d r-------   -       - -   /poudriere/data/.m/exp-11amd64-commit-test/ref
82210 jexecd               0 v c rwa----- 300 483194820 -   /dev/pts/10
82210 jexecd               1 v c rwa----- 300 483194820 -   /dev/pts/10
82210 jexecd               2 v c rwa----- 300 483194820 -   /dev/pts/10
82210 jexecd               3 v d r----n--   1       0 -   /poudriere/data/.m/exp-11amd64-commit-test/ref/.p/var/run
82210 jexecd               4 v r -w---n-l   1       0 -   /poudriere/data/.m/exp-11amd64-commit-test/ref/.p/var/run/exp-11amd64-commit-test.pid
82210 jexecd               5 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd               6 k - rw------   1       0 -   -
82210 jexecd               7 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd               8 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd               9 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              10 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              11 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              12 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              13 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              14 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              15 s - rw------   2       0 UDS 0 999 exp-11amd64-commit-test.sock
82210 jexecd              16 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              17 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              18 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              19 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              20 s - rw------   3       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              21 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              22 s - rw------   2       0 UDS 0 1000 exp-11amd64-commit-test.sock
82210 jexecd              23 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
82210 jexecd              24 s - rw------   2       0 UDS 0 0 exp-11amd64-commit-test.sock
~/git/poudriere2 # procstat -kk 82210
  PID    TID COMM                TDNAME              KSTACK
82210 101343 jexecd              -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _sleep+0x2c3 sbwait+0x67 soreceive_generic+0x274 soreceive+0x59 kern_recvit+0x1e7 sys_recvfrom+0x86 amd64_syscall+0x2d6 fast_syscall_common+0x101
~/git/poudriere2 # procstat -kka|grep rexec
46194 101580 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _sleep+0x2c3 sbwait+0x67 soreceive_generic+0x274 soreceive+0x59 kern_recvit+0x1e7 sys_recvfrom+0x86 amd64_syscall+0x2d6 fast_syscall_common+0x101
46395 102151 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46594 102158 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46796 102283 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46814 101164 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46822 102113 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46851 102779 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46854 102599 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46870 102286 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46872 102855 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46875 101619 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46885 102956 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46887 101486 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46896 102951 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46908 101887 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46917 102126 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46918 102885 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46925 101256 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46932 102879 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101
46936 102414 rexec               -                   mi_switch+0x182 sleepq_switch+0x10e sleepq_catch_signals+0x415 sleepq_wait_sig+0xf _cv_wait_sig+0x161 seltdwait+0x8d kern_select+0x962 sys_select+0x56 amd64_syscall+0x2d6 fast_syscall_common+0x101

Looks like jexecd and rexec 46194 are both trying to read from the socket at the same time. Not sure if that's a problem. Not sure what the 999/1000 are up there either.

82210 jexecd              20 s - rw------   3       0 UDS 0 0 exp-11amd64-commit-test.sock

That's the fd being read from in jexecd. It has the most references as well.

The rexec pid 46194 from above is also in nvlist_recv:

#0  _recvfrom () at _recvfrom.S:4
#1  0x000000000020ae4a in buf_recv (sock=4, buf=0x7fffffffe0b8, size=<optimized out>) at external/libnv/msgio.c:464
#2  0x0000000000208245 in nvlist_recv (sock=4, flags=0) at external/libnv/contrib/nvlist.c:1306
#3  0x000000000020550e in main (argc=<optimized out>, argv=<optimized out>) at src/libexec/poudriere/rexec/rexec.c:104

I guess it's just bogus to have multiple rexec trying to recvfrom because they might eat a command being sent from another rexec to jexecd. All processes involved race to waiting forever in recvfrom.

bdrewery commented 4 years ago

The issue seems to be that the child process exits before it gets added to kevent. So it never gets the exit notice to return to rexec.