Closed cockroach-teamcity closed 3 years ago
We suddenly saw n2 get extremely slow, which caused a performance blip and an error to make it back to the client:
W210227 14:04:58.739006 370 kv/kvserver/store_raft.go:493 ⋮ [n2,s2,r105/3:‹/Table/5{5-6}›] handle raft ready: 11.6s [applied=1, batches=1, state_assertions=0]
There's no real indication of what caused this slowdown in the CRDB or Pebble logs:
I210227 14:04:46.958073 13977441 vendor/github.com/cockroachdb/pebble/compaction.go:1371 â‹® [n2] [JOB 13477] compacting L0 [016910] (2.0 M) + L3 [016912 016913 016892 016893 016894 016896] (21 M)
I210227 14:04:46.958415 13977441 vendor/github.com/cockroachdb/pebble/compaction.go:1561 â‹® [n2] [JOB 13477] compacting: sstable created 016914
I210227 14:04:47.061771 13977441 vendor/github.com/cockroachdb/pebble/compaction.go:1561 â‹® [n2] [JOB 13477] compacting: sstable created 016915
I210227 14:04:58.740740 13977441 vendor/github.com/cockroachdb/pebble/compaction.go:1561 â‹® [n2] [JOB 13477] compacting: sstable created 016916
We do see a buildup in the goroutine count during the blip:
# before
I210227 14:04:44.990102 420 server/status/runtime.go:525 â‹® [n2] runtime stats: 5.9 GiB RSS, 1318 goroutines, 509 MiB/422 MiB/1.2 GiB GO alloc/idle/total, 3.8 GiB/4.7 GiB CGO alloc/total, 16109.9 CGO/sec, 405.9/75.2 %(u/s)time, 0.0 %gc (6x), 60 MiB/259 MiB (r/w)net
# during
I210227 14:04:54.990857 420 server/status/runtime.go:525 â‹® [n2] runtime stats: 5.9 GiB RSS, 2858 goroutines, 582 MiB/324 MiB/1.2 GiB GO alloc/idle/total, 3.8 GiB/4.7 GiB CGO alloc/total, 5064.1 CGO/sec, 140.0/28.1 %(u/s)time, 0.0 %gc (2x), 22 MiB/73 MiB (r/w)net
Because of this, we did capture a goroutine dump. These goroutines all seem to be stuck on a bit of raft slowdown. All 128 Raft scheduler goroutines are stuck writing to Pebble:
128: semacquire [Created by stop.(*Stopper).RunWorker @ stopper.go:215]
sync sema.go:56 runtime_Semacquire(*)
sync waitgroup.go:130 (*WaitGroup).Wait(*)
pebble commit.go:393 (*commitPipeline).publish(#829, *)
pebble commit.go:259 (*commitPipeline).Commit(#829, *, 1, *, *)
pebble db.go:547 (*DB).Apply(#1325, *, #255, #767, 0)
pebble batch.go:727 (*Batch).Commit(...)
storage pebble_batch.go:342 (*pebbleBatch).Commit(*, #766, 0, *)
kvserver replica_raft.go:705 (*Replica).handleRaftReadyRaftMuLocked(*, #2, *, 0, 0, 0, 0, 0, 0, 0, ...)
kvserver replica_raft.go:425 (*Replica).handleRaftReady(*, #2, *, 0, 0, 0, 0, 0, 0, 0, ...)
As is the liveness heartbeat goroutine (which explains the loss of liveness on this node, by design):
1: semacquire [Created by stop.(*Stopper).RunWorker @ stopper.go:215]
sync sema.go:56 runtime_Semacquire(*uint32(#18894))
sync waitgroup.go:130 (*WaitGroup).Wait(*WaitGroup(#18893))
pebble commit.go:393 (*commitPipeline).publish(*commitPipeline(#829), *Batch(#18892))
pebble commit.go:259 (*commitPipeline).Commit(*commitPipeline(#829), *Batch(#18892), bool(#50053), #18892)
pebble db.go:547 (*DB).Apply(*DB(#1325), *Batch(#18892), *WriteOptions(#255), 0x0, 0x0)
pebble batch.go:727 (*Batch).Commit(...)
storage pebble_batch.go:342 (*pebbleBatch).Commit(*pebbleBatch(#1360), EngineKey(0x1), 0x0)
storage engine.go:643 WriteSyncNoop(#195, #46473, #232, #1407, 0x0, 0x0)
kvserver node_liveness.go:1124 (*NodeLiveness).updateLiveness(#1642, #195, #46473, 2, 1, #68577, 0, 0, 2, 1, ...)
kvserver node_liveness.go:878 (*NodeLiveness).heartbeatInternal(#1642, #195, #46473, 2, 1, #68575, 0, 0, 0, 0, ...)
Interestingly, we even see other syscalls that interact with the filesystem in the goroutine dump:
1: syscall [Created by pebble.(*DB).maybeScheduleCompactionPicker @ compaction.go:1139]
syscall asm_linux_amd64.s:18 Syscall(0x4b, 0x5a, 0, 0, 0, #55, #50003)
syscall zsyscall_linux_amd64.go:429 Fdatasync(90, []byte(#19 len=32 cap=0))
vfs syncing_file_linux.go:62 (*syncingFile).syncFdatasync.func1()
vfs syncing_file.go:65 NewSyncingFile.func1(File(#28848))
vfs syncing_file_linux.go:61 (*syncingFile).syncFdatasync(*syncingFile(#50003), #35, #28847)
vfs syncing_file.go:128 (*syncingFile).Sync(*syncingFile(#50003), 0x2cfb, 0x0)
sstable writer.go:672 (*Writer).Close(*Writer(#40336), 0x0, 0x0)
pebble compaction.go:1652 (*DB).runCompaction.func4(*deleteCompactionHint(#11087), Compare(0x13), *fileMetadata(#11087), []uint64(0x13 len=0 cap=0))
pebble compaction.go:2040 (*DB).runCompaction(*DB(#1325), 13477, *compaction(#45594), pacer(#179), #47271, #3574, 0x2, 0x2, 0x0, ...)
1: syscall [Created by record.NewLogWriter @ log_writer.go:313]
syscall asm_linux_amd64.s:18 Syscall(0x4b, 0x12, 0, 0, 0, #2242, #2243)
syscall zsyscall_linux_amd64.go:429 Fdatasync(18, []byte(#19 len=32 cap=0))
vfs syncing_file_linux.go:62 (*syncingFile).syncFdatasync.func1()
vfs syncing_file.go:65 NewSyncingFile.func1(File(#27024))
vfs syncing_file_linux.go:61 (*syncingFile).syncFdatasync(*syncingFile(#3965), 0x0, 0x1d9)
vfs syncing_file.go:128 (*syncingFile).Sync(*syncingFile(#3965), 0x0, 0x0)
record log_writer.go:477 (*LogWriter).flushPending(*LogWriter(#65614), []byte(#7451 len=473 cap=10106), []*block(#42926 len=0 cap=16), 8594229561298, 79260161, 0x0, ...)
record log_writer.go:426 (*LogWriter).flushLoop(*LogWriter(#65614), Context(#2))
pprof runtime.go:37 Do(Context(#2), LabelSet(#878), func(0x1))
1: syscall [Created by stop.(*Stopper).RunWorker @ stopper.go:215]
syscall asm_linux_amd64.s:18 Syscall(0xd9, 0x3a4, #12289, 0x2000, 0x700, 0x640, #21)
syscall zsyscall_linux_amd64.go:465 Getdents(0x3a4, #12289, 0x2000, 0x2000, #67899, 0x0, #59485)
syscall syscall_linux.go:863 ReadDirent(...)
poll fd_unix.go:416 (*FD).ReadDirent(*FD(#38756), #12289, 0x2000, 0x2000, 0x0, 0x0, 0x0)
os dir_unix.go:50 (*File).readdirnames(*File(#28339), 18446744073709551615, readdirMode(0x0), #28339, 0x0, 0x0)
os dir.go:48 (*File).Readdirnames(...)
filepath path.go:419 readDirNames(string(#67903, len=20), FileInfo(0x14), WalkFunc(0x9), #32)
filepath path.go:361 walk(string(#67903, len=20), string(#216, len=825122799328), #59487, 0x0, 0x0)
filepath path.go:404 Walk(string(#67903, len=20), DirEntry(#59487), WalkDirFunc(0x0))
storage mvcc.go:3717 computeCapacity(SimpleMVCCIterator(#67903), Key(0x0), Key(0x0), 0, <unknown>(0x0), <unknown>(0x0), ...)
So I'm willing to chalk this up to something external to CRDB itself.
(roachtest).tpcc/mixed-headroom/n5cpu16 failed on release-20.2@c82e01111406152c655777087a5bc783dc7becf6:
More
Artifacts: [/tpcc/mixed-headroom/n5cpu16](https://teamcity.cockroachdb.com/viewLog.html?buildId=2725148&tab=artifacts#/tpcc/mixed-headroom/n5cpu16) Related: - #56419 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure](https://api.github.com/repos/cockroachdb/cockroach/labels/C-test-failure) [O-roachtest](https://api.github.com/repos/cockroachdb/cockroach/labels/O-roachtest) [O-robot](https://api.github.com/repos/cockroachdb/cockroach/labels/O-robot) [T-bulkio](https://api.github.com/repos/cockroachdb/cockroach/labels/T-bulkio) [branch-master](https://api.github.com/repos/cockroachdb/cockroach/labels/branch-master) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Atpcc%2Fmixed-headroom%2Fn5cpu16.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)