Closed krzaczek closed 6 years ago
This looks like OS-6467. But it looks like you are running the release that was fixed on. I'm in the process of seeing if I can reproduce on a newer platform image.
Just want to make a note that I was able to replicate this on platforms both before and after the OS-6467 fix. For anyone playing along at home a simpler way to reproduce this is to just clone gitaly and build it (or perhaps go install
of any project is enough).
git clone https://gitlab.com/gitlab-org/gitaly.git
cd gitaly
make
The process doesn't always abort but run it enough times and you should eventually hit it. I'm not sure what the root cause is yet but I know it's tripping on the following code.
if (lxsap->lxsa_handler == SIG_DFL || lxsap->lxsa_handler == SIG_IGN)
lx_err_fatal("lxsa_handler set to %s? How?!?!?",
(lxsap->lxsa_handler == SIG_DFL) ? "SIG_DFL" : "SIG_IGN");
root@e9e50dba-2312-69cb-c624-aefd5dada062:~/gitaly# /native/usr/bin/mdb core.platform-20180621T003454Z
mdb: core file data for mapping at 7fffeebc0000 not saved: Bad address
mdb: core file data for mapping at 7fffeee18000 not saved: Bad address
Loading modules: [ libc.so.1 ld.so.1 libc.so.6 ]
> $C
00007fffee8de840 LMfd`libc.so.1`__sigaction+0xa()
00007fffee8de890 LMfd`lx_brand.so.1`abort+0x6f()
00007fffee8de990 LMfd`lx_brand.so.1`lx_err_fatal+0xbf(7fffef4e6b58)
00007fffee8dea10 LMfd`lx_brand.so.1`lx_call_user_handler+0x11b(12, 7fffee8dee80, 7fffee8deb20)
00007fffee8dea20 LMfd`libc.so.1`__sighndlr+6()
00007fffee8deab0 LMfd`libc.so.1`call_user_handler+0x1db(12, 7fffee8dee80, 7fffee8deb20)
00007fffee8deb00 LMfd`libc.so.1`sigacthandler+0x116(12, 7fffee8dee80, 7fffee8deb20)
00007fffed9ffba0 runtime.futex+0x23()
00007fffed9ffbd8 runtime.notesleep+0x9b()
00007fffed9ffc00 runtime.stopm+0xe5()
00007fffed9ffca8 runtime.findrunnable+0x50c()
00007fffed9ffcf0 runtime.schedule+0x13b()
00007fffed9ffd28 runtime.goschedImpl+0xf7()
00007fffed9ffd40 runtime.gopreempt_m+0x36()
00007fffed9ffed0 runtime.newstack+0x2aa()
000000c420814820 runtime.morestack+0x89()
00007fffedad24d8 0xc420814910()
00007fffedacf8e8 0x7fffedbccd28()
00007fffe3e31b58 0x7fffedad24d8()
00007fffe333f7c8 0x7fffedacf8e8()
0000000000000000 0x7fffe3e31b58()
>
> 7fffef4e6b58/s
0x7fffef4e6b58: lxsa_handler set to %s? How?!?!?
I tried teasing out the lxsap
value via LX_DEBUG
and LX_DTRACE
to no avail. This kind of feels like OS-6467 but maybe now it's just harder to hit. But I need to do more digging to really tell. I'll try to poke at this some more tomorrow.
Hey @rzezeski did You have time to dig deeper into this issue ?.
@krzaczek I dug into this a bit more today. It seems like OS-6467 was only a partial fix. It seems like there is still some window where the SIGCHLD can sneak in during the vfork but before the signals are blocked and before the new child execs or exits. The interesting thing is in my core I notice the disposition is correct (i.e. handler set to runtime.cgoSigtramp
) but the assert still tripped anyways. My guess is the abort signal raced with the restoration of the sig handlers. I need to discuss with @jjelinek to see if he has any ideas.
> e::findstack -v
stack pointer for thread e: 7fffebebd840
[ 00007fffebebd840 LMfd`libc.so.1`__sigaction+0xa() ]
00007fffebebd890 LMfd`lx_brand.so.1`abort+0x6f()
00007fffebebd990 LMfd`lx_brand.so.1`lx_err_fatal+0xbf(7fffef4e6918)
00007fffebebda10 LMfd`lx_brand.so.1`lx_call_user_handler+0x11b(12, 7fffebebde80, 7fffebebdb20)
00007fffebebda20 LMfd`libc.so.1`__sighndlr+6()
00007fffebebdab0 LMfd`libc.so.1`call_user_handler+0x1db(12, 7fffebebde80, 7fffebebdb20)
00007fffebebdb00 LMfd`libc.so.1`sigacthandler+0x116(12, 7fffebebde80, 7fffebebdb20)
00007fffdbdffe38 runtime.futex+0x23()
00007fffdbdffe70 runtime.notesleep+0x9b()
00007fffdbdffe98 runtime.stopm+0xe5()
00007fffdbdffec8 runtime.exitsyscall0+0x109()
000000c4208a8848 runtime.mcall+0x5b()
00000000004118e9 syscall.Syscall+0x70()
c08500714c21058b 0x8b48000001e1850f()
> 0x12=D
18
> LMfd`lx_brand.so.1`stol_signo + (0t4 * 0t18)/D
LMfd`lx_brand.so.1`stol_signo+0x48: 17
...
> LMfd`lx_brand.so.1`lx_sighandlers::print lx_sighandlers_t lx_sa[0t17]
lx_sa[0t17] = {
lx_sa[0t17].lxsa_handler = runtime.cgoSigtramp
lx_sa[0t17].lxsa_flags = 0x1c000004
lx_sa[0t17].lxsa_restorer = 0x7fffeee11390
lx_sa[0t17].lxsa_mask = {
__bits = [ 0xffffffffffffffff ]
}
}
One thought: perhaps l_block_all_signals
should be volatile
? Otherwise, after returning from the B_BLOCK_ALL_SIGS
brand syscall, another thread might still deliver a signal (while the vfork thread has set the disposition to default) because it hasn't seen the update to l_block_all_signals
yet.
OS-7121 has been fixed; closing this out accordingly.
I've been running gitlab 11.0 for the past 24h on the new SmartOS 20181011T004530Z and it's working like a charm :) great job and thx for help @bcantrill @rzezeski 🥇
Smartos: 20171221T020409Z and latest 20180719T002030Z Zone: Ubuntu 16.04.5 LTS 20170403 7b5981c4-1889-11e7-b4c5-3f3bdfc9b88b
After upgrading golang to version 1.9 or 1.10 (from 1.8) and ruby to 2.4.x (from 2.3.x) I'm getting random failures when compiling gitlab-gitlay and gitlab-workhorse
The compilation process fails randomly with a core dump. I'm able to compile it but it takes a while to do it.
Core file attached >>>> here <<<<
Also on a production server running gitlab installation the same compiled application logs multiple errors a second looking like this
I did some testing and was not able to reproduce it on a KVM ubuntu image (429bf9f2-bb55-4c6f-97eb-046fa905dd03) version 20180222 and on a bare metal server running Ubuntu 16.04. This seems to be an issue only on a LX zone.
Steps to reproduce.
Install gitaly dependencies
Install ruby
Install golang
Clone repo and make
Running gitaly
This will produce
cpu_linux.cc:75] Cannot handle hot-plugged CPUs
errors