cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.2k stars 3.82k forks source link

ccl/backupccl: TestRestoreDatabaseVersusTable failed #134020

Closed cockroach-teamcity closed 2 weeks ago

cockroach-teamcity commented 1 month ago

ccl/backupccl.TestRestoreDatabaseVersusTable failed with artifacts on release-24.3 @ c077ebf6e98bcd579481b93c83f14184ab94f2e6:

goroutine 30230 gp=0x4011b7ce00 m=nil [select]:
runtime.gopark(0x400839def8?, 0x2?, 0x27?, 0x0?, 0x400839dec4?)
    GOROOT/src/runtime/proc.go:402 +0xc8 fp=0x400ecb7d70 sp=0x400ecb7d50 pc=0x453eb8
runtime.selectgo(0x400ecb7ef8, 0x400839dec0, 0x400d7524e0?, 0x0, 0x400c5c7a98?, 0x1)
    GOROOT/src/runtime/select.go:327 +0x614 fp=0x400ecb7e80 sp=0x400ecb7d70 pc=0x467584
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0x4011ef7c70, 0x1)
    external/org_golang_google_grpc/internal/transport/controlbuf.go:418 +0x14c fp=0x400ecb7f20 sp=0x400ecb7e80 pc=0xb5e21c
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0x400be4ad20)
    external/org_golang_google_grpc/internal/transport/controlbuf.go:552 +0x7c fp=0x400ecb7f80 sp=0x400ecb7f20 pc=0xb5ea8c
google.golang.org/grpc/internal/transport.NewServerTransport.func2()
    external/org_golang_google_grpc/internal/transport/http2_server.go:336 +0xd8 fp=0x400ecb7fd0 sp=0x400ecb7f80 pc=0xb73688
runtime.goexit({})
    src/runtime/asm_arm64.s:1222 +0x4 fp=0x400ecb7fd0 sp=0x400ecb7fd0 pc=0x48e8a4
created by google.golang.org/grpc/internal/transport.NewServerTransport in goroutine 30227
    external/org_golang_google_grpc/internal/transport/http2_server.go:333 +0x14d4

r0      0xffff4c916b08
r1      0x400c792000
r2      0xffff4c916b08
r3      0x40000
r4      0x0
r5      0x0
r6      0x400d59d818
r7      0x40000daf08
r8      0x95
r9      0x400
r10     0x0
r11     0x5
r12     0x1
r13     0x0
r14     0x0
r15     0xffffffffffffffff
r16     0xffff4ebfd5d0
r17     0xffff4f3fcd50
r18     0x971b80
r19     0x1
r20     0xffff4f3fcb18
r21     0xffff4f3fcbd0
r22     0x1
r23     0x6364
r24     0x7a61
r25     0x40000dc8f0
r26     0xffffffffffffffff
r27     0xffffffffffffff80
r28     0x400c2121c0
r29     0xffff4f3fcca8
lr      0x437fd4
sp      0xffff4f3fccb0
pc      0x42b81c
fault   0x20
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-43874

msbutler commented 1 month ago

This looks like a seg fault in the runtime? i quickly looked at the stack dump and unable to follow the seg fault to crdb code.

Some notes:

I doubt this has to do with backupccl code:

=== RUN   TestRestoreDatabaseVersusTable
    test_log_scope.go:165: test logs captured to: /artifacts/tmp/_tmp/47447a7ed84475b6aaa4b9399a882ce0/logTestRestoreDatabaseVersusTable2978839770
    test_log_scope.go:76: use -show-logs to present logs inline
    test_server_shim.go:152: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
=== RUN   TestRestoreDatabaseVersusTable/incomplete-db
    test_server_shim.go:152: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
SIGSEGV: segmentation violation
PC=0x42b81c m=19 sigcode=1 addr=0x20

goroutine 0 gp=0x400c2121c0 m=19 mp=0x400c210008 [idle]:
runtime.(*mspan).typePointersOfUnchecked(0x40168850e0?, 0x4015086c00?)
  GOROOT/src/runtime/mbitmap_allocheaders.go:202 +0x3c fp=0xffff4f3fccd0 sp=0xffff4f3fccb0 pc=0x42b81c
runtime.scanobject(0x400c792000, 0x40000dc168)
  GOROOT/src/runtime/mgcmark.go:1441 +0x1c4 fp=0xffff4f3fcd60 sp=0xffff4f3fccd0 pc=0x437fd4
runtime.gcDrain(0x40000dc168, 0x2)
  GOROOT/src/runtime/mgcmark.go:1242 +0x1d4 fp=0xffff4f3fcdd0 sp=0xffff4f3fcd60 pc=0x437774
runtime.gcDrainMarkWorkerDedicated(...)
  GOROOT/src/runtime/mgcmark.go:1124
runtime.gcBgMarkWorker.func2()
  GOROOT/src/runtime/mgc.go:1402 +0x154 fp=0xffff4f3fce20 sp=0xffff4f3fcdd0 pc=0x433a34
runtime.systemstack(0x0)
  src/runtime/asm_arm64.s:243 +0x6c fp=0xffff4f3fce30 sp=0xffff4f3fce20 pc=0x48c3fc

goroutine 38 gp=0x4000a80a80 m=19 mp=0x400c210008 [GC worker (active)]:
runtime.systemstack_switch()
  src/runtime/asm_arm64.s:200 +0x8 fp=0x4000a88730 sp=0x4000a88720 pc=0x48c378
runtime.gcBgMarkWorker()
  GOROOT/src/runtime/mgc.go:1370 +0x204 fp=0x4000a887d0 sp=0x4000a88730 pc=0x433614
runtime.goexit({})
  src/runtime/asm_arm64.s:1222 +0x4 fp=0x4000a887d0 sp=0x4000a887d0 pc=0x48e8a4
created by runtime.gcBgMarkStartWorkers in goroutine 1
  GOROOT/src/runtime/mgc.go:1234 +0x28
benbardin commented 1 month ago

What's a good next step here? Should this (retroactively) block the beta?

msbutler commented 1 month ago

i don't think so, but i can ask around.

benbardin commented 3 weeks ago

Provisionally assigning to Storage, on the hypothesis that this could be a bug with unsafe memory usage and they would be best equipped to track it down further. Thank you!

RaduBerinde commented 3 weeks ago

I have been trying to repro on an arm AWS node (same machine type as the failed test) with no luck so far. Whatever this is, it must be extremely rare. I filed https://github.com/cockroachdb/cockroach/issues/134312 to upgrade Go to 1.22.8 which has a fix which may in principle be relevant.

benbardin commented 3 weeks ago

Makes sense to me. Thank you very much, Radu!

RaduBerinde commented 3 weeks ago

Still no luck reproducing. I am removing the release-blocker label since probably the only course of action here is to upgrade Go (and that issue is marked as a blocker).

RaduBerinde commented 2 weeks ago

Go was upgraded which hopefully will address this. I was unable to reproduce the crash; not much more we can do here.