cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

roachtest: tpcc/mixed-headroom/n5cpu16 failed #61225

Closed cockroach-teamcity closed 3 years ago

cockroach-teamcity commented 3 years ago

(roachtest).tpcc/mixed-headroom/n5cpu16 failed on release-20.2@c82e01111406152c655777087a5bc783dc7becf6:

          |  4191.0s        0           64.8           19.0   9126.8  11811.2  12348.0  12348.0 delivery
          |  4191.0s        0          133.6          189.4   9663.7  11811.2  11811.2  11811.2 newOrder
          |  4191.0s        0           26.9           19.0   5368.7   6710.9   7247.8   7247.8 orderStatus
          |  4191.0s        0          362.9          190.1   9663.7  11811.2  11811.2  12348.0 payment
          |  4191.0s        0           19.9           19.0   6442.5  10200.5  11274.3  11274.3 stockLevel
          |  4192.0s        0           54.0           19.0   5905.6  11274.3  12348.0  12348.0 delivery
          |  4192.0s        0          122.1          189.4   7247.8  10737.4  11811.2  11811.2 newOrder
          |  4192.0s        0           18.0           19.0   5100.3   6174.0   6710.9   6710.9 orderStatus
          |  4192.0s        0          278.2          190.1   8053.1  12884.9  12884.9  12884.9 payment
          |  4192.0s        0           10.0           19.0   5637.1   6442.5   6442.5   6442.5 stockLevel
          | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
          |  4193.0s        0           40.8           19.0   4160.7   5637.1   6442.5   6442.5 delivery
          |  4193.0s        0          898.5          189.6   9663.7  13421.8  14495.5  14495.5 newOrder
          |  4193.0s        0           31.8           19.0   3489.7   4831.8   5368.7   5368.7 orderStatus
          |  4193.0s        0          209.0          190.1   6174.0  10737.4  12884.9  12884.9 payment
          |  4193.0s        0           39.8           19.0   4563.4  10737.4  12884.9  12884.9 stockLevel
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *exec.ExitError

    cluster.go:2655,tpcc.go:174,tpcc.go:269,test_runner.go:755: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitor).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2643
          | main.(*monitor).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2651
          | main.runTPCC
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:174
          | main.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:269
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:755
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitor).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2699
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2613
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:5652
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:191
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1374
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
More

Artifacts: [/tpcc/mixed-headroom/n5cpu16](https://teamcity.cockroachdb.com/viewLog.html?buildId=2725148&tab=artifacts#/tpcc/mixed-headroom/n5cpu16) Related: - #56419 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure](https://api.github.com/repos/cockroachdb/cockroach/labels/C-test-failure) [O-roachtest](https://api.github.com/repos/cockroachdb/cockroach/labels/O-roachtest) [O-robot](https://api.github.com/repos/cockroachdb/cockroach/labels/O-robot) [T-bulkio](https://api.github.com/repos/cockroachdb/cockroach/labels/T-bulkio) [branch-master](https://api.github.com/repos/cockroachdb/cockroach/labels/branch-master) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Atpcc%2Fmixed-headroom%2Fn5cpu16.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)

nvanbenschoten commented 3 years ago

We suddenly saw n2 get extremely slow, which caused a performance blip and an error to make it back to the client:

W210227 14:04:58.739006 370 kv/kvserver/store_raft.go:493 ⋮ [n2,s2,r105/3:‹/Table/5{5-6}›] handle raft ready: 11.6s [applied=1, batches=1, state_assertions=0]

There's no real indication of what caused this slowdown in the CRDB or Pebble logs:

I210227 14:04:46.958073 13977441 vendor/github.com/cockroachdb/pebble/compaction.go:1371 â‹® [n2] [JOB 13477] compacting L0 [016910] (2.0 M) + L3 [016912 016913 016892 016893 016894 016896] (21 M)
I210227 14:04:46.958415 13977441 vendor/github.com/cockroachdb/pebble/compaction.go:1561 â‹® [n2] [JOB 13477] compacting: sstable created 016914
I210227 14:04:47.061771 13977441 vendor/github.com/cockroachdb/pebble/compaction.go:1561 â‹® [n2] [JOB 13477] compacting: sstable created 016915
I210227 14:04:58.740740 13977441 vendor/github.com/cockroachdb/pebble/compaction.go:1561 â‹® [n2] [JOB 13477] compacting: sstable created 016916

We do see a buildup in the goroutine count during the blip:

# before
I210227 14:04:44.990102 420 server/status/runtime.go:525 â‹® [n2] runtime stats: 5.9 GiB RSS, 1318 goroutines, 509 MiB/422 MiB/1.2 GiB GO alloc/idle/total, 3.8 GiB/4.7 GiB CGO alloc/total, 16109.9 CGO/sec, 405.9/75.2 %(u/s)time, 0.0 %gc (6x), 60 MiB/259 MiB (r/w)net
# during
I210227 14:04:54.990857 420 server/status/runtime.go:525 â‹® [n2] runtime stats: 5.9 GiB RSS, 2858 goroutines, 582 MiB/324 MiB/1.2 GiB GO alloc/idle/total, 3.8 GiB/4.7 GiB CGO alloc/total, 5064.1 CGO/sec, 140.0/28.1 %(u/s)time, 0.0 %gc (2x), 22 MiB/73 MiB (r/w)net

Because of this, we did capture a goroutine dump. These goroutines all seem to be stuck on a bit of raft slowdown. All 128 Raft scheduler goroutines are stuck writing to Pebble:

128: semacquire [Created by stop.(*Stopper).RunWorker @ stopper.go:215]
    sync            sema.go:56                               runtime_Semacquire(*)
    sync            waitgroup.go:130                         (*WaitGroup).Wait(*)
    pebble          commit.go:393                            (*commitPipeline).publish(#829, *)
    pebble          commit.go:259                            (*commitPipeline).Commit(#829, *, 1, *, *)
    pebble          db.go:547                                (*DB).Apply(#1325, *, #255, #767, 0)
    pebble          batch.go:727                             (*Batch).Commit(...)
    storage         pebble_batch.go:342                      (*pebbleBatch).Commit(*, #766, 0, *)
    kvserver        replica_raft.go:705                      (*Replica).handleRaftReadyRaftMuLocked(*, #2, *, 0, 0, 0, 0, 0, 0, 0, ...)
    kvserver        replica_raft.go:425                      (*Replica).handleRaftReady(*, #2, *, 0, 0, 0, 0, 0, 0, 0, ...)

As is the liveness heartbeat goroutine (which explains the loss of liveness on this node, by design):

1: semacquire [Created by stop.(*Stopper).RunWorker @ stopper.go:215]
    sync            sema.go:56                               runtime_Semacquire(*uint32(#18894))
    sync            waitgroup.go:130                         (*WaitGroup).Wait(*WaitGroup(#18893))
    pebble          commit.go:393                            (*commitPipeline).publish(*commitPipeline(#829), *Batch(#18892))
    pebble          commit.go:259                            (*commitPipeline).Commit(*commitPipeline(#829), *Batch(#18892), bool(#50053), #18892)
    pebble          db.go:547                                (*DB).Apply(*DB(#1325), *Batch(#18892), *WriteOptions(#255), 0x0, 0x0)
    pebble          batch.go:727                             (*Batch).Commit(...)
    storage         pebble_batch.go:342                      (*pebbleBatch).Commit(*pebbleBatch(#1360), EngineKey(0x1), 0x0)
    storage         engine.go:643                            WriteSyncNoop(#195, #46473, #232, #1407, 0x0, 0x0)
    kvserver        node_liveness.go:1124                    (*NodeLiveness).updateLiveness(#1642, #195, #46473, 2, 1, #68577, 0, 0, 2, 1, ...)
    kvserver        node_liveness.go:878                     (*NodeLiveness).heartbeatInternal(#1642, #195, #46473, 2, 1, #68575, 0, 0, 0, 0, ...)

Interestingly, we even see other syscalls that interact with the filesystem in the goroutine dump:

1: syscall [Created by pebble.(*DB).maybeScheduleCompactionPicker @ compaction.go:1139]
    syscall         asm_linux_amd64.s:18                     Syscall(0x4b, 0x5a, 0, 0, 0, #55, #50003)
    syscall         zsyscall_linux_amd64.go:429              Fdatasync(90, []byte(#19 len=32 cap=0))
    vfs             syncing_file_linux.go:62                 (*syncingFile).syncFdatasync.func1()
    vfs             syncing_file.go:65                       NewSyncingFile.func1(File(#28848))
    vfs             syncing_file_linux.go:61                 (*syncingFile).syncFdatasync(*syncingFile(#50003), #35, #28847)
    vfs             syncing_file.go:128                      (*syncingFile).Sync(*syncingFile(#50003), 0x2cfb, 0x0)
    sstable         writer.go:672                            (*Writer).Close(*Writer(#40336), 0x0, 0x0)
    pebble          compaction.go:1652                       (*DB).runCompaction.func4(*deleteCompactionHint(#11087), Compare(0x13), *fileMetadata(#11087), []uint64(0x13 len=0 cap=0))
    pebble          compaction.go:2040                       (*DB).runCompaction(*DB(#1325), 13477, *compaction(#45594), pacer(#179), #47271, #3574, 0x2, 0x2, 0x0, ...)

1: syscall [Created by record.NewLogWriter @ log_writer.go:313]
    syscall         asm_linux_amd64.s:18                     Syscall(0x4b, 0x12, 0, 0, 0, #2242, #2243)
    syscall         zsyscall_linux_amd64.go:429              Fdatasync(18, []byte(#19 len=32 cap=0))
    vfs             syncing_file_linux.go:62                 (*syncingFile).syncFdatasync.func1()
    vfs             syncing_file.go:65                       NewSyncingFile.func1(File(#27024))
    vfs             syncing_file_linux.go:61                 (*syncingFile).syncFdatasync(*syncingFile(#3965), 0x0, 0x1d9)
    vfs             syncing_file.go:128                      (*syncingFile).Sync(*syncingFile(#3965), 0x0, 0x0)
    record          log_writer.go:477                        (*LogWriter).flushPending(*LogWriter(#65614), []byte(#7451 len=473 cap=10106), []*block(#42926 len=0 cap=16), 8594229561298, 79260161, 0x0, ...)
    record          log_writer.go:426                        (*LogWriter).flushLoop(*LogWriter(#65614), Context(#2))
    pprof           runtime.go:37                            Do(Context(#2), LabelSet(#878), func(0x1))

1: syscall [Created by stop.(*Stopper).RunWorker @ stopper.go:215]
    syscall         asm_linux_amd64.s:18                     Syscall(0xd9, 0x3a4, #12289, 0x2000, 0x700, 0x640, #21)
    syscall         zsyscall_linux_amd64.go:465              Getdents(0x3a4, #12289, 0x2000, 0x2000, #67899, 0x0, #59485)
    syscall         syscall_linux.go:863                     ReadDirent(...)
    poll            fd_unix.go:416                           (*FD).ReadDirent(*FD(#38756), #12289, 0x2000, 0x2000, 0x0, 0x0, 0x0)
    os              dir_unix.go:50                           (*File).readdirnames(*File(#28339), 18446744073709551615, readdirMode(0x0), #28339, 0x0, 0x0)
    os              dir.go:48                                (*File).Readdirnames(...)
    filepath        path.go:419                              readDirNames(string(#67903, len=20), FileInfo(0x14), WalkFunc(0x9), #32)
    filepath        path.go:361                              walk(string(#67903, len=20), string(#216, len=825122799328), #59487, 0x0, 0x0)
    filepath        path.go:404                              Walk(string(#67903, len=20), DirEntry(#59487), WalkDirFunc(0x0))
    storage         mvcc.go:3717                             computeCapacity(SimpleMVCCIterator(#67903), Key(0x0), Key(0x0), 0, <unknown>(0x0), <unknown>(0x0), ...)

So I'm willing to chalk this up to something external to CRDB itself.