cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.9k stars 3.78k forks source link

SIGILL: illegal instruction #117423

Open jeremymv2 opened 8 months ago

jeremymv2 commented 8 months ago

Describe the problem

Recently we started seeing this error and stacktrace with version 23.1.1. Running /cockroach/cockroach even without any arguments is enough to trigger it. Running image cockroachdb/cockroach:v22.2.17 works fine on the same node.

To Reproduce

node1:~$ k run -it db-debug --image cockroachdb/cockroach:v23.1.1 -n cockroach bash
If you don't see a command prompt, try pressing enter.
[root@db-debug cockroach]# /cockroach/cockroach
SIGILL: illegal instruction
PC=0xc49bbb m=0 sigcode=2
instruction bytes: 0x48 0xc7 0xcc 0x8 0x1c 0x0 0x0 0x0 0x48 0x8d 0xd 0x81 0x25 0xe7 0x4 0x48

goroutine 1 [running, locked to thread]:
errors.New(...)
        GOROOT/src/errors/errors.go:59
google.golang.org/grpc.init()
        google.golang.org/grpc/external/org_golang_google_grpc/clientconn.go:1523 +0x37b fp=0xc00045f010 sp=0xc00045efc0 pc=0xc49bbb
runtime.doInit(0xa2815e0)
        GOROOT/src/runtime/proc.go:6348 +0x126 fp=0xc00045f140 sp=0xc00045f010 pc=0x4ae426
runtime.doInit(0xa258660)
        GOROOT/src/runtime/proc.go:6325 +0x71 fp=0xc00045f270 sp=0xc00045f140 pc=0x4ae371
runtime.doInit(0xa250760)
        GOROOT/src/runtime/proc.go:6325 +0x71 fp=0xc00045f3a0 sp=0xc00045f270 pc=0x4ae371
runtime.doInit(0xa2564a0)
        GOROOT/src/runtime/proc.go:6325 +0x71 fp=0xc00045f4d0 sp=0xc00045f3a0 pc=0x4ae371
runtime.doInit(0xa2746c0)
        GOROOT/src/runtime/proc.go:6325 +0x71 fp=0xc00045f600 sp=0xc00045f4d0 pc=0x4ae371
runtime.doInit(0xa2804c0)
        GOROOT/src/runtime/proc.go:6325 +0x71 fp=0xc00045f730 sp=0xc00045f600 pc=0x4ae371
runtime.doInit(0xa263540)
        GOROOT/src/runtime/proc.go:6325 +0x71 fp=0xc00045f860 sp=0xc00045f730 pc=0x4ae371
runtime.doInit(0xa265e60)
        GOROOT/src/runtime/proc.go:6325 +0x71 fp=0xc00045f990 sp=0xc00045f860 pc=0x4ae371
runtime.doInit(0xa25abc0)
        GOROOT/src/runtime/proc.go:6325 +0x71 fp=0xc00045fac0 sp=0xc00045f990 pc=0x4ae371
runtime.doInit(0xa281180)
        GOROOT/src/runtime/proc.go:6325 +0x71 fp=0xc00045fbf0 sp=0xc00045fac0 pc=0x4ae371
runtime.doInit(0xa273640)
        GOROOT/src/runtime/proc.go:6325 +0x71 fp=0xc00045fd20 sp=0xc00045fbf0 pc=0x4ae371
runtime.doInit(0xa25f3c0)
        GOROOT/src/runtime/proc.go:6325 +0x71 fp=0xc00045fe50 sp=0xc00045fd20 pc=0x4ae371
runtime.doInit(0xa23ee20)
        GOROOT/src/runtime/proc.go:6325 +0x71 fp=0xc00045ff80 sp=0xc00045fe50 pc=0x4ae371
runtime.main()
        GOROOT/src/runtime/proc.go:233 +0x1d3 fp=0xc00045ffe0 sp=0xc00045ff80 pc=0x4a0fd3
runtime.goexit()
        GOROOT/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc00045ffe8 sp=0xc00045ffe0 pc=0x4d3101

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        GOROOT/src/runtime/proc.go:363 +0xd6 fp=0xc000098fb0 sp=0xc000098f90 pc=0x4a13d6
runtime.goparkunlock(...)
        GOROOT/src/runtime/proc.go:369
runtime.forcegchelper()
        GOROOT/src/runtime/proc.go:302 +0xad fp=0xc000098fe0 sp=0xc000098fb0 pc=0x4a126d
runtime.goexit()
        GOROOT/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc000098fe8 sp=0xc000098fe0 pc=0x4d3101
created by runtime.init.6
        GOROOT/src/runtime/proc.go:290 +0x25

goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        GOROOT/src/runtime/proc.go:363 +0xd6 fp=0xc000099790 sp=0xc000099770 pc=0x4a13d6
runtime.goparkunlock(...)
        GOROOT/src/runtime/proc.go:369
runtime.bgsweep(0x0?)
        GOROOT/src/runtime/mgcsweep.go:278 +0x8e fp=0xc0000997c8 sp=0xc000099790 pc=0x48b94e
runtime.gcenable.func1()
        GOROOT/src/runtime/mgc.go:178 +0x26 fp=0xc0000997e0 sp=0xc0000997c8 pc=0x480526
runtime.goexit()
        GOROOT/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000997e8 sp=0xc0000997e0 pc=0x4d3101
created by runtime.gcenable
        GOROOT/src/runtime/mgc.go:178 +0x6b

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc0000be000?, 0x6f6c850?, 0x1?, 0x0?, 0x0?)
        GOROOT/src/runtime/proc.go:363 +0xd6 fp=0xc000099f70 sp=0xc000099f50 pc=0x4a13d6
runtime.goparkunlock(...)
        GOROOT/src/runtime/proc.go:369
runtime.(*scavengerState).park(0xb200060)
        GOROOT/src/runtime/mgcscavenge.go:389 +0x53 fp=0xc000099fa0 sp=0xc000099f70 pc=0x4899f3
runtime.bgscavenge(0x0?)
        GOROOT/src/runtime/mgcscavenge.go:617 +0x45 fp=0xc000099fc8 sp=0xc000099fa0 pc=0x489fc5
runtime.gcenable.func2()
        GOROOT/src/runtime/mgc.go:179 +0x26 fp=0xc000099fe0 sp=0xc000099fc8 pc=0x4804c6
runtime.goexit()
        GOROOT/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc000099fe8 sp=0xc000099fe0 pc=0x4d3101
created by runtime.gcenable
        GOROOT/src/runtime/mgc.go:179 +0xaa

goroutine 5 [finalizer wait]:
runtime.gopark(0xb203a80?, 0xc000007860?, 0x0?, 0x0?, 0xc000098770?)
        GOROOT/src/runtime/proc.go:363 +0xd6 fp=0xc000098628 sp=0xc000098608 pc=0x4a13d6
runtime.goparkunlock(...)
        GOROOT/src/runtime/proc.go:369
runtime.runfinq()
        GOROOT/src/runtime/mfinal.go:180 +0x10f fp=0xc0000987e0 sp=0xc000098628 pc=0x47f62f
runtime.goexit()
        GOROOT/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000987e8 sp=0xc0000987e0 pc=0x4d3101
created by runtime.createfing
        GOROOT/src/runtime/mfinal.go:157 +0x45

rax    0xc0003d88d0
rbx    0x7f620be2c108
rcx    0x10
rdx    0x0
rdi    0x0
rsi    0x1
rbp    0xc00045f000
rsp    0xc00045efc0
r8     0x10
r9     0x0
r10    0x7f620bcd7528
r11    0x6f6c850
r12    0x203000
r13    0x8
r14    0xc0000061a0
r15    0x7f61e320dc46
rip    0xc49bbb
rflags 0x10206
cs     0x33
fs     0x0
gs     0x0
[root@db-debug cockroach]# exit
exit
Session ended, resume using 'kubectl attach db-debug -c db-debug -i -t' command when the pod is running
node1:~$

The instruction bytes disasemble to:


Disassembly

Raw Hex (zero bytes in bold):

48C7CC81C000488DD8125E7448   

String Literal:

"\x48\xC7\xCC\x81\xC0\x00\x48\x8D\xD8\x12\x5E\x74\x48"

Array Literal:

{ 0x48, 0xC7, 0xCC, 0x81, 0xC0, 0x00, 0x48, 0x8D, 0xD8, 0x12, 0x5E, 0x74, 0x48 }
Disassembly:

0:  48 c7                   rex.W (bad)
2:  cc                      int3
3:  81 c0 00 48 8d d8       add    eax,0xd88d4800
9:  12 5e 74                adc    bl,BYTE PTR [rsi+0x74]
c:  48                      rex.W

Node1 CPU

node1:~$ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      45 bits physical, 48 bits virtual
CPU(s):                             4
On-line CPU(s) list:                0-3
Thread(s) per core:                 1
Core(s) per socket:                 1
Socket(s):                          4
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              85
Model name:                         Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz
Stepping:                           7
CPU MHz:                            2095.077
BogoMIPS:                           4190.15
Hypervisor vendor:                  VMware
Virtualization type:                full
L1d cache:                          128 KiB
L1i cache:                          128 KiB
L2 cache:                           4 MiB
L3 cache:                           44 MiB
NUMA node0 CPU(s):                  0-3
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 p
                                    cid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid av
                                    x512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Jira issue: CRDB-35169

blathers-crl[bot] commented 8 months ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I was unable to automatically find someone to ping.

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

yuzefovich commented 8 months ago

Do you see the same problem if you use 23.1.latest?

jeremymv2 commented 8 months ago

@yuzefovich testing v23.1.13 on this same node does not yield the same SIGILL issue and the binary at least initially runs long enough to handle the --help argument. Was there a build parameter or environmental change?

jeremymv2 commented 8 months ago

In fact, thus far, I have only observed this issue with v23.1.1. v23.1.2 runs successfully. https://github.com/cockroachdb/cockroach/compare/v23.1.2...v23.1.1

yuzefovich commented 8 months ago

Glad to hear that later versions do work :) I'll ping my colleagues to see if someone wants to look closer into this issue.