Open amitsadaphule opened 3 years ago
Hello, I am Blathers. I am here to help you get the issue triaged.
It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates.
I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.
The issue seemed to be originated somewhere between tags v20.1.0-beta.4 (has issue) and v20.1.0-beta.2 (no issue). Used git bisect
to find the commit that introduced the issue and it pointed to this commit. Need to check the code and find out exactly what change triggered this for ppc64le but not for amd64.
@petermattis can you please point me in the right direction to resolve this issue?
Tried running the test through gdb as:
# go test -c -tags ' make ppc64le_redhat_linux' -ldflags '-X github.com/cockroachdb/cockroach/pkg/build.typ=development -extldflags "" -X "github.com/cockroachdb/cockroach/pkg/build.tag=v20.1.0-dirty" -X "github.com/cockroachdb/cockroach/pkg/build.rev=9d456b9ec82cbf9a740a092c0d9f56da48779689" -X "github.com/cockroachdb/cockroach/pkg/build.cgoTargetTriple=ppc64le-redhat-linux" ' -run "TestZip" -timeout 30m ./pkg/cli -v -count=1
# gdb cli.test
SIGSEGV
was thrown from multiple threads during the execution, but I was able to continue the execution, until the morestack on g0
error showed up. Pasting the relevant log below:
Thread 237 "cli.test" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x3fff7dfded90 (LWP 3686)]
0x00003fffb79b5668 in backtrace () from /lib64/libc.so.6
(gdb) bt
#0 0x00003fffb79b5668 in backtrace () from /lib64/libc.so.6
#1 0x0000000012a234ec in InternalHandler () at /root/go/src/github.com/cockroachdb/cockroach/c-deps/libroach/stack_trace.cc:160
#2 InternalHandler () at /root/go/src/github.com/cockroachdb/cockroach/c-deps/libroach/stack_trace.cc:149
#3 <signal handler called>
#4 0x00000000102c1b40 in runtime.futex () at /usr/local/go/src/runtime/sys_linux_ppc64x.s:472
#5 0x000000001028ce94 in runtime.futexsleep (addr=0x161fffc0 <runtime.timers+928>, val=0, ns=882281981) at /usr/local/go/src/runtime/os_linux.go:50
#6 0x000000001026a2e8 in runtime.notetsleep_internal (~r2=<optimized out>, n=0x161fffc0 <runtime.timers+928>, ns=882281981)
at /usr/local/go/src/runtime/lock_futex.go:193
#7 0x000000001026a49c in runtime.notetsleepg (~r2=<optimized out>, n=0x161fffc0 <runtime.timers+928>, ns=882281981) at /usr/local/go/src/runtime/lock_futex.go:228
#8 0x00000000102b03f0 in runtime.timerproc (tb=0x161fffa0 <runtime.timers+896>) at /usr/local/go/src/runtime/time.go:311
#9 0x00000000102c1044 in runtime.goexit () at /usr/local/go/src/runtime/asm_ppc64x.s:884
(gdb) c
Continuing.
fatal: morestack on g0
(0xc0003c20b0, 0xc004f3e000, 0xc0038461a0)
/root/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:198 +0x118
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
/root/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:191 +0xa4
goroutine 5543806 [select, 10 minutes]:
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport.(*http2Server).keepalive(0xc003c0c300)
/root/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/http2_server.go:935 +0x18c
created by github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport.newHTTP2Server
/root/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/http2_server.go:282 +0xd00
goroutine 5543034 [sync.Cond.Wait, 10 minutes]:
runtime.goparkunlock(...)
/usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc001d4fed0, 0xb1)
/usr/local/go/src/runtime/sema.go:510 +0x104
sync.(*Cond).Wait(0xc001d4fec0)
/usr/local/go/src/sync/cond.go:56 +0xcc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker(0xc0012c2180, 0x140c79a0, 0xc00882c5a0)
/root/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:197 +0xbc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2(0x140c79a0, 0xc00882c5a0)
/root/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:166 +0x3c
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc0068ea860, 0xc004f3e000, 0xc0068ea850)
/root/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:198 +0x118
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
/root/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:191 +0xa4
goroutine 5543012 [sync.Cond.Wait, 10 minutes]:
runtime.goparkunlock(...)
/usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc001d4fed0, 0xfa5cb12400000097)
/usr/local/go/src/runtime/sema.go:510 +0x104
sync.(*Cond).Wait(0xc001d4fec0)
/usr/local/go/src/sync/cond.go:56 +0xcc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker(0xc0012c2180, 0x140c79a0, 0xc008f00f30)
/root/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:197 +0xbc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2(0x140c79a0, 0xc008f00f30)
/root/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:166 +0x3c
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc0068ea5a0, 0xc004f3e000, 0xc0068ea590)
/root/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:198 +0x118
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
/root/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:191 +0xa4
r0 0x12a234ec r1 0xc000eeaf10
r2 0x3fffb7a87100 r3 0x3fffa3994d08
r4 0x64 r5 0x18
r6 0x4 r7 0x4
r8 0x4 r9 0x4082007c7c240000
r10 0x387f1d683fe015ee r11 0x3fffb7f90478
r12 0x3fffb7a84e48 r13 0x3fffa2a06500
r14 0x1026a274 r15 0xc000467678
r16 0xffffffffffffffff r17 0x1ff
r18 0x1 r19 0x13fc91ec
r20 0xc002b2a000 r21 0xc00084ea80
r22 0x0 r23 0x0
r24 0x8 r25 0x38
r26 0x0 r27 0x0
r28 0x0 r29 0x0
r30 0xc000eebd78 r31 0xc000eeaf10
pc 0x3fffb79b5668 ctr 0x3fffb79b5610
link 0x12a234ec xer 0x0
ccr 0x44404424 trap 0x300
[Thread 0x3fff55b4ed90 (LWP 4253) exited]
[Thread 0x3fff56b6ed90 (LWP 4251) exited]
[Thread 0x3fff4f9eed90 (LWP 4260) exited]
[Thread 0x3fff7d7ced90 (LWP 3687) exited]
[Thread 0x3fff4cdeed90 (LWP 4264) exited]
[Thread 0x3fff5431ed90 (LWP 4256) exited]
[Thread 0x3fff9e6fed90 (LWP 3537) exited]
[Thread 0x3fff5737ed90 (LWP 4250) exited]
[Thread 0x3fff54b2ed90 (LWP 4255) exited]
[Thread 0x3fffa29fed90 (LWP 3542) exited]
[Thread 0x3fff6449ed90 (LWP 4913) exited]
[Thread 0x3fff5533ed90 (LWP 4254) exited]
[Thread 0x3fff532fed90 (LWP 4258) exited]
[Thread 0x3fff5839ed90 (LWP 4248) exited]
[Thread 0x3fff53b0ed90 (LWP 4257) exited]
[Thread 0x3fff58baed90 (LWP 4247) exited]
[Thread 0x3fff4bdced90 (LWP 4266) exited]
[Thread 0x3fff4d5fed90 (LWP 4263) exited]
[Thread 0x3fffac5fed90 (LWP 3465) exited]
[Thread 0x3fff5635ed90 (LWP 4252) exited]
[Thread 0x3fff4e9ced90 (LWP 4262) exited]
[Thread 0x3fff457fed90 (LWP 4930) exited]
[Thread 0x3fff4c5ded90 (LWP 4265) exited]
[Thread 0x3fff3cdeed90 (LWP 4946) exited]
[Thread 0x3fffa99fed90 (LWP 3469) exited]
[Thread 0x3fff44feed90 (LWP 4929) exited]
[Thread 0x3fff4f1ded90 (LWP 4261) exited]
[Thread 0x3fffaa5eed90 (LWP 3468) exited]
[Thread 0x3fff7dfded90 (LWP 3686) exited]
[Thread 0x3fffb3d9ed90 (LWP 3451) exited]
[Thread 0x3fffaadfed90 (LWP 3467) exited]
[Thread 0x3fff501fed90 (LWP 4259) exited]
[Thread 0x3fff57b8ed90 (LWP 4249) exited]
Thread 8 received signal SIG34, Real-time event 34.
[Switching to Thread 0x3fffb15eed90 (LWP 3456)]
0x00000000102c1b40 in runtime.futex () at /usr/local/go/src/runtime/sys_linux_ppc64x.s:472
472 SYSCALL $SYS_futex
(gdb) c
Continuing.
../../gdb/linux-nat.c:1784: internal-error: virtual void linux_nat_target::resume(ptid_t, int, gdb_signal): Assertion `signo == GDB_SIGNAL_0' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) y
This is a bug, please report it. For instructions, see:
<http://www.gnu.org/software/gdb/bugs/>.
../../gdb/linux-nat.c:1784: internal-error: virtual void linux_nat_target::resume(ptid_t, int, gdb_signal): Assertion `signo == GDB_SIGNAL_0' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
Any suggestions?
Cc @seth-priya @gerrith3
Build gdb 10.1
from source and using that now for debugging, since GNU gdb (GDB) Red Hat Enterprise Linux 8.2-12.el8
which is available from yum repo runs into issue where it endlessly prints the following warning:
warning: unable to open /proc file '/proc/XXX/status'
This seems to be a known issue: https://sourceware.org/pipermail/gdb-prs/2020q1/026258.html
Also, built glibc 2.28
with debug symbols in order to debug inside backtrace()
call to find out the exact point of failure. Working on building/executing cli.test against this glibc build.
@knz any suggestions?
Can you try using go 1.15.10 or newer. This looks like a bug in a go runtime which may have been fixed already upstream.
Thanks for the feedback @knz, I'll try that. Actually, I had tried with go 1.15.3
before for tag v20.1.0
, but ran into too many test failures with that. So, switched back to go 1.13.5
looking at the pre-reqs table at https://www.cockroachlabs.com/docs/v20.1/install-cockroachdb-linux#build-from-source.
I tried building v20.1.13
which requires go 1.15.x
with go 1.15.11
. But the issue persists there too.
v20.1.0 is very outdated now. Is there a chance you could try building something more recent? What does the master
branch say?
Built master
with go 1.15.11
and executed the test. The issue is seen there too.
I think given the symptoms we are looking at an issue in the go runtime. I would say, either try to build with Go 1.16 instead of 1.15, or file an issue upstream (in the go repository) with clear reproduction steps.
Thanks @knz ! :) I'll try the build with go 1.16
and see if it works. Otherwise, will raise an issue in go repo.
With go 1.16.3
, I'm facing some issue in building the code. I see the following in the beginning of the make buildoss
command's log:
go install -v ./pkg/cmd/prereqs
go: cannot find main module, but found Gopkg.lock in /root/go/src/github.com/cockroachdb/cockroach
to create a module there, run:
go mod init
Running make with -j8
GOPATH set to /root/go
go install -v ./pkg/cmd/prereqs
go: cannot find main module, but found Gopkg.lock in /root/go/src/github.com/cockroachdb/cockroach
to create a module there, run:
go mod init
And then it fails after configure with error:
make: *** No rule to make target 'bin/prereqs', needed by 'bin/uptodate'. Stop.
Full log below
# make buildoss
GOPATH set to /root/go
Detected change in build system. Rebooting make.
gitdir=$(git rev-parse --git-dir 2>/dev/null || true); \
if test -n "$gitdir"; then \
git submodule update --init --recursive; \
fi
mkdir -p bin
touch bin/.submodules-initialized
go install -v ./pkg/cmd/prereqs
go: cannot find main module, but found Gopkg.lock in /root/go/src/github.com/cockroachdb/cockroach
to create a module there, run:
go mod init
Running make with -j8
GOPATH set to /root/go
go install -v ./pkg/cmd/prereqs
go: cannot find main module, but found Gopkg.lock in /root/go/src/github.com/cockroachdb/cockroach
to create a module there, run:
go mod init
rm -rf /root/go/native/ppc64le-redhat-linux/jemalloc
mkdir -p /root/go/native/ppc64le-redhat-linux/jemalloc
cd /root/go/native/ppc64le-redhat-linux/jemalloc && /root/go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc/configure --enable-prof
checking for xsltproc... false
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether compiler is cray... no
checking whether compiler supports -std=gnu11... yes
checking whether compiler supports -Wall... yes
checking whether compiler supports -Werror=declaration-after-statement... yes
checking whether compiler supports -Wshorten-64-to-32... no
checking whether compiler supports -Wsign-compare... yes
checking whether compiler supports -pipe... yes
checking whether compiler supports -g3... yes
checking how to run the C preprocessor... gcc -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking whether byte ordering is bigendian... no
checking size of void *... 8
checking size of int... 4
checking size of long... 8
checking size of long long... 8
checking size of intmax_t... 8
checking build system type... powerpc64le-unknown-linux-gnu
checking host system type... powerpc64le-unknown-linux-gnu
checking for ar... ar
checking malloc.h usability... yes
checking malloc.h presence... yes
checking for malloc.h... yes
checking whether malloc_usable_size definition can use const argument... no
checking for library containing log... -lm
checking whether __attribute__ syntax is compilable... yes
checking whether compiler supports -fvisibility=hidden... yes
checking whether compiler supports -Werror... yes
checking whether compiler supports -herror_on_warning... yes
checking whether tls_model attribute is compilable... yes
checking whether compiler supports -Werror... yes
checking whether compiler supports -herror_on_warning... yes
checking whether alloc_size attribute is compilable... yes
checking whether compiler supports -Werror... yes
checking whether compiler supports -herror_on_warning... yes
checking whether format(gnu_printf, ...) attribute is compilable... yes
checking whether compiler supports -Werror... yes
checking whether compiler supports -herror_on_warning... yes
checking whether format(printf, ...) attribute is compilable... yes
checking for a BSD-compatible install... /usr/bin/install -c
checking for ranlib... ranlib
checking for ld... /usr/bin/ld
checking for autoconf... /usr/bin/autoconf
checking for memalign... yes
checking for valloc... yes
checking whether compiler supports -O3... yes
checking whether compiler supports -funroll-loops... yes
checking unwind.h usability... yes
checking unwind.h presence... yes
checking for unwind.h... yes
checking for _Unwind_Backtrace in -lgcc... yes
checking configured backtracing method... libgcc
checking for sbrk... yes
checking whether utrace(2) is compilable... no
checking whether valgrind is compilable... no
checking whether a program using __builtin_unreachable is compilable... yes
checking whether a program using __builtin_ffsl is compilable... yes
checking LG_PAGE... 16
Missing VERSION file, and unable to generate it; creating bogus VERSION
checking pthread.h usability... yes
checking pthread.h presence... yes
checking for pthread.h... yes
checking for pthread_create in -lpthread... yes
checking whether pthread_atfork(3) is compilable... yes
checking for library containing clock_gettime... none required
checking whether clock_gettime(CLOCK_MONOTONIC_COARSE, ...) is compilable... yes
checking whether clock_gettime(CLOCK_MONOTONIC, ...) is compilable... yes
checking whether mach_absolute_time() is compilable... no
checking whether compiler supports -Werror... yes
checking whether syscall(2) is compilable... yes
checking for secure_getenv... yes
checking for issetugid... no
checking for _malloc_thread_cleanup... no
checking for _pthread_mutex_init_calloc_cb... no
checking for TLS... yes
checking whether C11 atomics is compilable... yes
checking whether atomic(9) is compilable... no
checking whether Darwin OSAtomic*() is compilable... no
checking whether madvise(2) is compilable... yes
checking whether madvise(..., MADV_FREE) is compilable... yes
checking whether madvise(..., MADV_DONTNEED) is compilable... yes
checking whether madvise(..., MADV_[NO]HUGEPAGE) is compilable... yes
checking whether to force 32-bit __sync_{add,sub}_and_fetch()... no
checking whether to force 64-bit __sync_{add,sub}_and_fetch()... no
checking for __builtin_clz... yes
checking whether Darwin os_unfair_lock_*() is compilable... no
checking whether Darwin OSSpin*() is compilable... no
checking whether glibc malloc hook is compilable... yes
checking whether glibc memalign hook is compilable... yes
checking whether pthreads adaptive mutexes is compilable... yes
checking for stdbool.h that conforms to C99... yes
checking for _Bool... yes
configure: creating ./config.status
config.status: creating Makefile
config.status: creating jemalloc.pc
config.status: creating doc/html.xsl
config.status: creating doc/manpages.xsl
config.status: creating doc/jemalloc.xml
config.status: creating include/jemalloc/jemalloc_macros.h
config.status: creating include/jemalloc/jemalloc_protos.h
config.status: creating include/jemalloc/jemalloc_typedefs.h
config.status: creating include/jemalloc/internal/jemalloc_internal.h
config.status: creating test/test.sh
config.status: creating test/include/test/jemalloc_test.h
config.status: creating config.stamp
config.status: creating bin/jemalloc-config
config.status: creating bin/jemalloc.sh
config.status: creating bin/jeprof
config.status: creating include/jemalloc/jemalloc_defs.h
config.status: creating include/jemalloc/internal/jemalloc_internal_defs.h
config.status: creating test/include/test/jemalloc_test_defs.h
config.status: executing include/jemalloc/internal/private_namespace.h commands
config.status: executing include/jemalloc/internal/private_unnamespace.h commands
config.status: executing include/jemalloc/internal/public_symbols.txt commands
config.status: executing include/jemalloc/internal/public_namespace.h commands
config.status: executing include/jemalloc/internal/public_unnamespace.h commands
config.status: executing include/jemalloc/internal/size_classes.h commands
config.status: executing include/jemalloc/jemalloc_protos_jet.h commands
config.status: executing include/jemalloc/jemalloc_rename.h commands
config.status: executing include/jemalloc/jemalloc_mangle.h commands
config.status: executing include/jemalloc/jemalloc_mangle_jet.h commands
config.status: executing include/jemalloc/jemalloc.h commands
===============================================================================
jemalloc version : 0.0.0-0-g0000000000000000000000000000000000000000
library revision : 2
CONFIG : --enable-prof CFLAGS=-g1 LDFLAGS=
CC : gcc
CONFIGURE_CFLAGS : -std=gnu11 -Wall -Werror=declaration-after-statement -Wsign-compare -pipe -g3 -fvisibility=hidden -O3 -funroll-loops
SPECIFIED_CFLAGS : -g1
EXTRA_CFLAGS :
CPPFLAGS : -D_GNU_SOURCE -D_REENTRANT
LDFLAGS :
EXTRA_LDFLAGS :
LIBS : -lm -lgcc -lm -lpthread
RPATH_EXTRA :
XSLTPROC : false
XSLROOT :
PREFIX : /usr/local
BINDIR : /usr/local/bin
DATADIR : /usr/local/share
INCLUDEDIR : /usr/local/include
LIBDIR : /usr/local/lib
MANDIR : /usr/local/share/man
srcroot : /root/go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc/
abs_srcroot : /root/go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc/
objroot :
abs_objroot : /root/go/native/ppc64le-redhat-linux/jemalloc/
JEMALLOC_PREFIX :
JEMALLOC_PRIVATE_NAMESPACE
: je_
install_suffix :
malloc_conf :
autogen : 0
cc-silence : 1
debug : 0
code-coverage : 0
stats : 1
prof : 1
prof-libunwind : 0
prof-libgcc : 1
prof-gcc : 0
tcache : 1
thp : 1
fill : 1
utrace : 0
valgrind : 0
xmalloc : 0
munmap : 0
lazy_lock : 0
tls : 1
cache-oblivious : 1
===============================================================================
make: *** No rule to make target 'bin/prereqs', needed by 'bin/uptodate'. Stop.
Any suggestions? Does it have anything to do with this?
First run make build
; that will generate all the dependencies; then you can run make buildoss
.
Sorry, missed to mention one thing earlier. The build failure with go 1.16.3
that was mentioned in the above comment was for cockroach v20.1.12
. master
builds fine with go 1.16.3
. But the fatal: morestack with g0
issue persists there too for pkg/cli=>TestZip.
This is a messy bug. Ultimately, the issue is backtrace
isn't really async-signal-safe. This causes a messy chain of failures. The ppc glibc backtrace
walks stack frames assuming an ELFv2 compatible stack frame with backchain pointers. No guarantees can be made about the stack layout during a signal with mixed go/c code. Thus, a segfault is almost guaranteed at some point when sending signals to all threads and calling backtrace
in signal context.
The g0 issue occurs due to the segfault, and the go runtime attempting to setup a call to runtime.sigpanic on a thread running a c function through cgo. This is likely a golang issue which can be addressed to improve diagnostics when crashing.
I'd recommend not doing this on ppc64, or using a backtrace limit which prevents walking beyond the kernel stacked signal frame.
NB we had a perhaps-related bug in that area fixed in #64081 (quite sure this doesn't solve the problem at hand, but it may simplify it)
Thanks @pmur @knz for the guidance! @knz you're right about the issue not getting resolved with this change. I've raised the issue with a Go expert on ppc64le, hoping to get some feedback soon.
@amitsadaphule, it is a signal-safety issue.
Calling backtrace
from signal context, from a thread running go code, is likely to cause backtrace
to segfault. The initial signal is called from the alternate signal stack (setup by go, which the same stack used for g0). The segfault causes a second interrupt frame to be placed onto the alternate state. This second signal is taken recursively, and handled by the Go runtime. This effectively forces execution to resume at runtime.sigpanic
when the interrupt returns.
The go runtime takes no precautions against nested signals, and thus it runs runtime.sigpanic
as if in normal context, and the first stack size check fails with the "fatal: morestack on g0". Ignoring a few other details, the aforementioned function expects to run on a normal go stack (that is not g0), not the alternate stack which happened to running an unrelated C handler. Thus, you will see quite a few of those messages as each thread running go code will fail in a similar manner when subjected to the above.
Anyhow, I would suggest to update the title of this bug to "... fails when calling backtrace on ppc64le" or similar more accurately reflect what is happening. Limiting the stacktrace to 3 entries on any ppc target is probably enough to ensure you never attempt to unwind beyond the kernel's signal stack frame.
I suspect you would see similar error messages on other architectures in similar conditions (that is a segfault from a C interrupt while running go code).
@laboger here.
As Paul stated in an earlier post, calling backtrace in this way on ppc64le or probably other non-x86 targets will not be successful because the stackframe generated by Go is not what libc expects. It will either SEGV or stop at the point where it changes from non-Go to Go. And even when it does work, as on x86, it will only dump out the stack for one goroutine for each thread since backtrace doesn't know about goroutines. The goroutine it happens to display might not be very meaningful.
There have been other similar issues in the Go community related to unwinding the stack when there is a mix of Go and non-Go and in particular one issue recommends using Ian Lance Taylor's cgosymbolizer to get the stack with cgo. (I see that has been vendored into the cockroach repository.) Is it possible to create a Go signal handler to dump out the goroutine stacks using the cgosymbolizer?
Is it possible to create a Go signal handler to dump out the goroutine stacks using the cgosymbolizer?
Do you mean changing libroach/stack_trace.go
to use cgosymbolizer instead of backtrace
?
@petermattis does this make sense to you?
What I was thinking originally was to create a Go signal handler which called cgosymbolizer and keep the existing signal handler for non-Go code.
I'm going to open an issue in the Go community to get more feedback on what the best solution might be.
This likely requires a patch to glibc to workaround. A patch was proposed earlier this year to unify the ppc backtrace implementation (https://sourceware.org/pipermail/libc-alpha/2021-February/122600.html). However, it is still pending in an attempt to preserve compatibility with older binaries (I think those built with gcc <= 8).
I was able to verify this patch would workaround this issue by building glibc with the applied patch, building a test binary for cli.test, appending lib64 to the ld.so library path (cli.test links against ncurses so this is a manual fixup), and running something like $GLIBC_BUILDDIR/testrun.sh ./cli.test -test.run=TestZip, and observing the crash no longer occurs.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
Description
I built the cockroachdb v20.1.0 on ppc64le and when I executed the test suite, I found that the test
TestZip
in pkg/cli had failed. I tried executing the test independently, which also resulted in consistent failure. Just to check if this behavior is also seen for other cockroachdb versions, I tried v19.2.10 and that did not show this failure on independent execution of the test.Any pointers on what could be the source of the issue will be a great help.
To Reproduce
Here is the command I used to execute the failing test independently:
make test PKG=./pkg/cli TESTS=TestZip TESTFLAGS='-v -count=1'
Pasting relevant parts of the log, for cockroach
v20.1.0
:However, when I executed the same test on x86_64, no failures were seen. Also, the error was not seen for cockroach
v19.2.10
in the same environment.Environment CockroachDB [v20.1.0] Architecture and OS: [ppc64le/RHEL/UBI 8.3] Go: [1.13.5]
Jira issue: CRDB-6371