Golang random abort on compile

krzaczek commented 6 years ago

Smartos: 20171221T020409Z and latest 20180719T002030Z Zone: Ubuntu 16.04.5 LTS 20170403 7b5981c4-1889-11e7-b4c5-3f3bdfc9b88b

Ubuntu 16.04.5 LTS \n \l
ruby 2.4.4p296 (2018-03-28 revision 63013) [x86_64-linux]
go version go1.10.3 linux/amd64

After upgrading golang to version 1.9 or 1.10 (from 1.8) and ruby to 2.4.x (from 2.3.x) I'm getting random failures when compiling gitlab-gitlay and gitlab-workhorse

The compilation process fails randomly with a core dump. I'm able to compile it but it takes a while to do it.

cd ruby && bundle show gitaly-proto # sanity check
/tmp/gitaly/ruby/vendor/bundle/ruby/2.4.0/gems/gitaly-proto-0.107.0
touch .ruby-bundle
rm -rf /tmp/gitaly/_build
mkdir -p "/tmp/gitaly/_build/src/gitlab.com/gitlab-org/"
ln -sf ../../../.. "/tmp/gitaly/_build/src/gitlab.com/gitlab-org/gitaly"
mkdir -p "/tmp/gitaly/_build/bin"
touch "/tmp/gitaly/_build/.ok"
go install -ldflags '-X gitlab.com/gitlab-org/gitaly/internal/version.version=0.113.0-27-g3a9ae12 -X gitlab.com/gitlab-org/gitaly/internal/version.buildtime=20180725.170322' gitlab.com/gitlab-org/gitaly/cmd/gitaly gitlab.com/gitlab-org/gitaly/cmd/gitaly-ssh
Makefile:55: recipe for target 'build' failed
make: *** [build] Aborted (core dumped)

Core file attached >>>> here <<<<

[root@smartos11 /zones/05110f14-6f61-4c3b-b6c0-ad46a0eb8076/cores]# mdb core.go.98776
mdb: core file data for mapping at 7fffeebc0000 not saved: Bad address
mdb: core file data for mapping at 7fffeee18000 not saved: Bad address
Loading modules: [ libc.so.1 ld.so.1 libc.so.6 ]
> ::status
debugging core file of go (64-bit) from 05110f14-6f61-4c3b-b6c0-ad46a0eb8076
file: /zones/05110f14-6f61-4c3b-b6c0-ad46a0eb8076/root/usr/local/go/bin/go
initial argv: go install -ldflags -X gitlab.com/gitlab-org/gitaly/internal/version.version=0.
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=98776 uid=0 code=-1
> $C
00007fffebe4d840 LMfd`libc.so.1`_lwp_kill+0xa()
00007fffebe4d890 LMfd`lx_brand.so.1`abort+0x8a()
00007fffebe4d990 LMfd`lx_brand.so.1`lx_err_fatal+0xbf(7fffef4e6ef8)
00007fffebe4da10 LMfd`lx_brand.so.1`lx_call_user_handler+0x11b(12, 7fffebe4de80, 7fffebe4db20)
00007fffebe4da20 LMfd`libc.so.1`__sighndlr+6()
00007fffebe4dab0 LMfd`libc.so.1`call_user_handler+0x1db(12, 7fffebe4de80, 7fffebe4db20)
00007fffebe4db00 LMfd`libc.so.1`sigacthandler+0x116(12, 7fffebe4de80, 7fffebe4db20)
000000c420788e00 syscall.Syscall6+0x2a()
000000c420788e78 os.(*Process).wait+0x3c()
000000c420788ea8 os.(*Process).Wait+0x2b()
000000c420788f20 os/exec.(*Cmd).Wait+0x5c()
000000c420788f48 os/exec.(*Cmd).Run+0x5c()
000000c420789110 cmd/go/internal/work.(*Builder).runOut+0x905()
000000c420789410 cmd/go/internal/work.gcToolchain.gc+0xd3a()
000000c4207894b0 cmd/go/internal/work.(*gcToolchain).gc+0xdf()
000000c420789e60 cmd/go/internal/work.(*Builder).build+0x1a7e()
000000c420789ee8 cmd/go/internal/work.(*Builder).Do.func1+0x72()
000000c420789fb8 cmd/go/internal/work.(*Builder).Do.func2+0xbb()
0000000000000000 runtime.goexit+1()

Also on a production server running gitlab installation the same compiled application logs multiple errors a second looking like this

time="2018-07-25T19:31:28+02:00" level=info msg="E0725 19:31:28.499941654    9211 cpu_linux.cc:75]            Cannot handle hot-plugged CPUs" supervisor.args="[bundle exec bin/ruby-cd /home/git/gitlab /home/git/gitaly/ruby/bin/gitaly-ruby 9068 /tmp/gitaly-ruby237515577/socket.1]" supervisor.name=gitaly-ruby.1

I did some testing and was not able to reproduce it on a KVM ubuntu image (429bf9f2-bb55-4c6f-97eb-046fa905dd03) version 20180222 and on a bare metal server running Ubuntu 16.04. This seems to be an issue only on a LX zone.

Steps to reproduce.

Install gitaly dependencies

sudo apt-get install -y build-essential zlib1g-dev libyaml-dev libssl-dev libgdbm-dev libre2-dev libreadline-dev libncurses5-dev libffi-dev curl openssh-server checkinstall libxml2-dev libxslt-dev libcurl4-openssl-dev libicu-dev logrotate rsync python-docutils pkg-config cmake

#Install git-core
sudo apt-get install -y git-core

Install ruby

mkdir /tmp/ruby && cd /tmp/ruby
curl --remote-name --progress https://cache.ruby-lang.org/pub/ruby/2.4/ruby-2.4.4.tar.gz
echo 'ec82b0d53bd0adad9b19e6b45e44d54e9ec3f10c  ruby-2.4.4.tar.gz' | shasum -c - && tar xzf ruby-2.4.4.tar.gz
cd ruby-2.4.4

./configure --disable-install-rdoc
make
sudo make install
sudo gem install bundler --no-ri --no-rdoc

Install golang

curl --remote-name --progress https://dl.google.com/go/go1.10.3.linux-amd64.tar.gz
echo 'fa1b0e45d3b647c252f51f5e1204aba049cde4af177ef9f2181f43004f901035  go1.10.3.linux-amd64.tar.gz' | shasum -a256 -c - && \
  sudo tar -C /usr/local -xzf go1.10.3.linux-amd64.tar.gz
sudo ln -sf /usr/local/go/bin/{go,godoc,gofmt} /usr/local/bin/
rm go1.10.3.linux-amd64.tar.gz

Clone repo and make

cd /tmp
git clone https://gitlab.com/gitlab-org/gitaly.git
cd gitlay
make

Running gitaly

cd /tmp/gitaly
ln -s /tmp /home/git
cp config.toml.example config.toml
GITALY_DEBUG=1 ./gitaly config.toml

This will produce cpu_linux.cc:75] Cannot handle hot-plugged CPUs errors

rzezeski commented 6 years ago

This looks like OS-6467. But it looks like you are running the release that was fixed on. I'm in the process of seeing if I can reproduce on a newer platform image.

rzezeski commented 6 years ago

Just want to make a note that I was able to replicate this on platforms both before and after the OS-6467 fix. For anyone playing along at home a simpler way to reproduce this is to just clone gitaly and build it (or perhaps go install of any project is enough).

git clone https://gitlab.com/gitlab-org/gitaly.git
cd gitaly
make

The process doesn't always abort but run it enough times and you should eventually hit it. I'm not sure what the root cause is yet but I know it's tripping on the following code.

    if (lxsap->lxsa_handler == SIG_DFL || lxsap->lxsa_handler == SIG_IGN)
        lx_err_fatal("lxsa_handler set to %s?  How?!?!?",
            (lxsap->lxsa_handler == SIG_DFL) ? "SIG_DFL" : "SIG_IGN");

root@e9e50dba-2312-69cb-c624-aefd5dada062:~/gitaly# /native/usr/bin/mdb core.platform-20180621T003454Z 
mdb: core file data for mapping at 7fffeebc0000 not saved: Bad address
mdb: core file data for mapping at 7fffeee18000 not saved: Bad address
Loading modules: [ libc.so.1 ld.so.1 libc.so.6 ]
> $C
00007fffee8de840 LMfd`libc.so.1`__sigaction+0xa()
00007fffee8de890 LMfd`lx_brand.so.1`abort+0x6f()
00007fffee8de990 LMfd`lx_brand.so.1`lx_err_fatal+0xbf(7fffef4e6b58)
00007fffee8dea10 LMfd`lx_brand.so.1`lx_call_user_handler+0x11b(12, 7fffee8dee80, 7fffee8deb20)
00007fffee8dea20 LMfd`libc.so.1`__sighndlr+6()
00007fffee8deab0 LMfd`libc.so.1`call_user_handler+0x1db(12, 7fffee8dee80, 7fffee8deb20)
00007fffee8deb00 LMfd`libc.so.1`sigacthandler+0x116(12, 7fffee8dee80, 7fffee8deb20)
00007fffed9ffba0 runtime.futex+0x23()
00007fffed9ffbd8 runtime.notesleep+0x9b()
00007fffed9ffc00 runtime.stopm+0xe5()
00007fffed9ffca8 runtime.findrunnable+0x50c()
00007fffed9ffcf0 runtime.schedule+0x13b()
00007fffed9ffd28 runtime.goschedImpl+0xf7()
00007fffed9ffd40 runtime.gopreempt_m+0x36()
00007fffed9ffed0 runtime.newstack+0x2aa()
000000c420814820 runtime.morestack+0x89()
00007fffedad24d8 0xc420814910()
00007fffedacf8e8 0x7fffedbccd28()
00007fffe3e31b58 0x7fffedad24d8()
00007fffe333f7c8 0x7fffedacf8e8()
0000000000000000 0x7fffe3e31b58()
> 
> 7fffef4e6b58/s
0x7fffef4e6b58: lxsa_handler set to %s?  How?!?!?

I tried teasing out the lxsap value via LX_DEBUG and LX_DTRACE to no avail. This kind of feels like OS-6467 but maybe now it's just harder to hit. But I need to do more digging to really tell. I'll try to poke at this some more tomorrow.

krzaczek commented 6 years ago

Hey @rzezeski did You have time to dig deeper into this issue ?.

rzezeski commented 6 years ago

@krzaczek I dug into this a bit more today. It seems like OS-6467 was only a partial fix. It seems like there is still some window where the SIGCHLD can sneak in during the vfork but before the signals are blocked and before the new child execs or exits. The interesting thing is in my core I notice the disposition is correct (i.e. handler set to runtime.cgoSigtramp) but the assert still tripped anyways. My guess is the abort signal raced with the restoration of the sig handlers. I need to discuss with @jjelinek to see if he has any ideas.

> e::findstack -v
stack pointer for thread e: 7fffebebd840
[ 00007fffebebd840 LMfd`libc.so.1`__sigaction+0xa() ]
  00007fffebebd890 LMfd`lx_brand.so.1`abort+0x6f()
  00007fffebebd990 LMfd`lx_brand.so.1`lx_err_fatal+0xbf(7fffef4e6918)
  00007fffebebda10 LMfd`lx_brand.so.1`lx_call_user_handler+0x11b(12, 7fffebebde80, 7fffebebdb20)
  00007fffebebda20 LMfd`libc.so.1`__sighndlr+6()
  00007fffebebdab0 LMfd`libc.so.1`call_user_handler+0x1db(12, 7fffebebde80, 7fffebebdb20)
  00007fffebebdb00 LMfd`libc.so.1`sigacthandler+0x116(12, 7fffebebde80, 7fffebebdb20)
  00007fffdbdffe38 runtime.futex+0x23()
  00007fffdbdffe70 runtime.notesleep+0x9b()
  00007fffdbdffe98 runtime.stopm+0xe5()
  00007fffdbdffec8 runtime.exitsyscall0+0x109()
  000000c4208a8848 runtime.mcall+0x5b()
  00000000004118e9 syscall.Syscall+0x70()
  c08500714c21058b 0x8b48000001e1850f()

> 0x12=D
                18  

> LMfd`lx_brand.so.1`stol_signo + (0t4 * 0t18)/D
LMfd`lx_brand.so.1`stol_signo+0x48:             17    
...
> LMfd`lx_brand.so.1`lx_sighandlers::print lx_sighandlers_t lx_sa[0t17]
lx_sa[0t17] = {
    lx_sa[0t17].lxsa_handler = runtime.cgoSigtramp
    lx_sa[0t17].lxsa_flags = 0x1c000004
    lx_sa[0t17].lxsa_restorer = 0x7fffeee11390
    lx_sa[0t17].lxsa_mask = {
        __bits = [ 0xffffffffffffffff ]
    }
}

rzezeski commented 6 years ago

One thought: perhaps l_block_all_signals should be volatile? Otherwise, after returning from the B_BLOCK_ALL_SIGS brand syscall, another thread might still deliver a signal (while the vfork thread has set the disposition to default) because it hasn't seen the update to l_block_all_signals yet.

rzezeski commented 6 years ago

I created OS-7121 to track this.

bcantrill commented 6 years ago

OS-7121 has been fixed; closing this out accordingly.

krzaczek commented 6 years ago

I've been running gitlab 11.0 for the past 24h on the new SmartOS 20181011T004530Z and it's working like a charm :) great job and thx for help @bcantrill @rzezeski 🥇

TritonDataCenter / smartos-live

Golang random abort on compile #794