Open cockroach-teamcity opened 11 months ago
looking into this
Job was 56% done:
905777184646561793 RESTORE RESTORE FROM 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=25000/v22.2.0/inc-count=48/2022/12/20-225543.90?AUTH=implicit' AS OF SYSTEM TIME '2022-12-21T02:00:00Z' root {115,132,120,146,123,113,112,143,140,119,133,109,105,116,157,141,138,121,106,129,136,142,156,144,147,118,127,150,139,107,111,114,128,154,152,151,126,135,145,122,131,110,124,137,117,108,125,155,130,153,149,148,134} running NULL 2023-10-05 07:40:59.305935+00 2023-10-05 07:41:00.002923+00 NULL 2023-10-05 08:42:09.565124+00 0.5623053312301636 NULL 1 414291085975248250 2023-10-05 07:41:00.002924+00 2023-10-05 07:41:30.002924+00 1 NULL NULL
The job starts at 07:40 and a couple of minutes later until the test times out we see slow range RPCs
for AddSSTables. Towards the end of the test we see as high as 400s slow range RPCs.
❯ rg 'slow range RPC.*AddSSTable*' only-jobs | wc -l
16
teamcity-12059861-1696483956-28-n8cpu8-0005> W231005 08:37:53.519417 51649 kv/kvclient/kvcoord/dist_sender.go:1882 ⋮ [T1,Vsystem,n5,f‹981c994f›,job=905777184646561793,distsql.gateway=1,dist
sql.appname=‹$ internal-resume-job-905777184646561793›] 5810 slow range RPC: have been waiting 473.09s (1 attempts) for RPC AddSSTable [/Table/139/1/‹200000172921039›/‹0›,/Table/139/1/‹200
000173048214›/‹2›/‹1›/‹NULL›) to r813:‹/Table/139/1/20000017{1649339-4998955}› [(n5,s5):4, (n7,s7):2, (n4,s4):5, next=6, gen=101, sticky=1696495568.402488951,0]; resp: ‹(err: <nil>), *kvpb.AddSSTableResponse›
Similarly theres 19 instances of slow AdminSplits and 3 of AdminScatter all in the 90-100s range.
After the job started at 07:40 we saw an inbox communication error and a retry at 07:45:
teamcity-12059861-1696483956-28-n8cpu8-0001> W231005 07:45:56.669497 12809 ccl/backupccl/restore_job.go:210 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=905777184646561793›] 1465 encountered retryable error: importing 938 ranges: ‹inbox communication error›: grpc: ‹context canceled› [code 1/Canceled]
We do now have much better checkpointing to avoid redoing done work and this shouldn't have a major impact on performance but its worth noting anyways. We have a very high number of connection reset IO retries when reading the backup from the bucket:
❯ rg 'connection reset' only-jobs | wc -l
1626
This would definitely cause the restore to run for longer since we're retrying on a large number of the reads from s3. We also generally see raft struggling to keep up pretty much immediately after the restore starts. Messages like acquired raft proposal quota after 15.404145179s
go up to almost 30s and are being logged constantly.
Overall, looks like the cluster was pretty overwhelmed and unable to serve KV requests. I will ping KV in case they're interested in giving this cluster a look.
Is this any related to #111160? Particularly, do we see one node spinning at 100% CPU?
Upd: Yes, looks a lot like it. We should checkout CPU profiles for the overloaded n4
here.
The node spends 50% time in raft processing (and much of that is loading/storing sideloaded entries). Not that this looks particularly surprising though, for this test.
It would be nice to compare with a healthy node profile, but we don't have it here. Can run the test manually and collect some.
What surprises me though is how nicely n4
stays just around 2 GiB of memory usage. Is it possible that we just somehow get a particularly under-provisioned node? We would OOM though, instead of backpressuring like this.
cc @cockroachdb/replication
Writes started at 07:41, and CPU almost immediately spiraled up to 100%.
Counter-intuitively, n4
peaks by all means, but processes the least amount of SSTs.
The first CPU profile (when usage is above 80%) was captured at 07:41:26. I don't see anything unordinary here, everything takes a fair share of CPU.
One thing stood out in n4
stacks:
8915 goroutine 19884 [select, 59 minutes]:
8916 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc005960580, {0x79221b0, 0xc009318730}, 0xc004d08360, 0xc00234b500)
8917 github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:205 +0x7a7
8918 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc005960580, {0x79221b0, 0xc009318730}, 0xc004d08360, 0x66a9158)
8919 github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:511 +0x3c3
8920 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes(0xc005960580, {0x79221b0?, 0xc009318730?}, 0xc004d08360)
8921 github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:189 +0x751
8922 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(0xc004d08360?, {0x79221b0?, 0xc009318730?}, 0x8000101?)
8923 github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:121 +0x25
8924 github.com/cockroachdb/cockroach/pkg/kv/kvserver.sendProbe({0x79221b0, 0xc009318730}, {0x7955830, 0xc005960580})
8925 github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:226 +0x14b
8926 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaCircuitBreaker).asyncProbe.func1({0x7922258?, 0xc0095e4ff0?})
8927 github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:200 +0xf7
8928 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
8929 github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146
8930 created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
8931 github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b
8932
@erikgrinaker are we ever allowed to run this method that long? It looks like we evaluated the thing but never got the result or cancel. It's a probe though, and also doesn't hold any locks.
We discussed this separately, but this does seem suspect -- we're basically waiting for the reproposal machinery to get this proposal through, and it isn't happening for whatever reason.
In this restore run, there is this periodic spike of proposals (from other node, n1
, not the slow n4
; n1
is probably the gateway). Every 2.5 minutes there is a spike to 35/s (I wonder why).
Those are not AddSST
s though. AddSST
s are sent at a rate <2/s.
Not reproposals either, we only see reproposals on n4
:
There are also only a couple of reproposals with a new LAI (NB: I made this graph show the absolute counter, not rate), from n4
:
Running a "happy" run to compare the behaviours. I do see the same periodic spikes from one node, so that's probably just the shape of the workload. There aren't many reproposals though, but notably it's not zero and constantly trickles at a rate of a few reproposals/minute.
In the happy run, the Raft portion of the CPU profile is hardly noticeable (looks similar on multiple nodes I've looked at):
The profile we've seen above is almost double of this one, and noticeably has a lot more raft processing.
@sumeerbhola Could you help reading this graph (TLDR from above is that n4
is using 100% CPU and super slow)? Why are we seeing that the number of slots on n4
is significantly higher than on other nodes? What does it mean? We also see that the used slots on this node spiked 2 times. The CPU was 100% at all times though.
The p50 log commit latency here is 10 seconds. Pretty sure this is just an overloaded disk or something.
Unfortunately, we don't have CPU iowait metrics. However, most of the CPU usage appears to be user, not sys, and I think iowait would show up as sys.
n3 and n5 are linearly falling behind on their Raft logs for some reason:
Can CPU starvation cause this commit latency jump (i.e. it's just the measurement error rather than disk slowness)? It appears that both CPU and disk are slow, but I'm not sure which is the cause. Good point re iowait time and sys time. Maybe we should consider improving this observability aspect.
Conjecture re nodes 3 and 5: maybe there is some range that node 4 can't deliver to them. It still holds a small fraction of leases.
Many of the slow Raft readys are simply slow log appends, although there are a few during apply and other too.
teamcity-12059861-1696483956-28-n8cpu8-0004> I231005 07:41:18.626887 311 kv/kvserver/store_raft.go:699 ⋮ [T1,Vsystem,n4,s4,r200/4:‹/Table/134/1/20000008{05…-54…}›,raft] 387 raft ready handling: 2.12s [append=2.12s, apply=0.00s, non-blocking-sync=0.00s, other=0.00s], wrote [append-batch=264 B, append-sst=8.6 MiB (1), ]; node might be overloaded
We see replication AC firing a lot too (note to self: fix the log context and formatting here).
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055274 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4462 stream t1/s1 was blocked: durations: tokens deducted: regular 0 B elastic 0 B
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055297 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4463 stream t1/s3 was blocked: durations: tokens deducted: regular 0 B elastic 0 B
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055310 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4464 stream t1/s4 was blocked: durations: elastic 4.470939686stokens deducted: regular 0 B elastic 40 MB
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055319 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4465 stream t1/s6 was blocked: durations: elastic 108.264µstokens deducted: regular 0 B elastic 74 MB
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055328 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4466 stream t1/s8 was blocked: durations: elastic 88.412183mstokens deducted: regular 0 B elastic 52 MB
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055334 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4467 stream t1/s7 was blocked: durations: tokens deducted: regular 0 B elastic 0 B
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055342 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4468 stream t1/s2 was blocked: durations: elastic 238.371768mstokens deducted: regular 0 B elastic 22 MB
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055348 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4469 stream t1/s5 was blocked: durations: tokens deducted: regular 0 B elastic 0 B
Can CPU starvation cause this commit latency jump (i.e. it's just the measurement error rather than disk slowness)?
I'd be very surprised if it stalled for 10 seconds.
The goroutine scheduling latency is pretty poor though, at 100ms.
Conjecture re nodes 3 and 5: maybe there is some range that node 4 can't deliver to them. It still holds a small fraction of leases.
It's probably the other way around. The metric (on a leader) measures the number of entries that followers are behind on. In this case we probably have ranges on node 3 and 5 for which node 4 is a follower. Since 4 is slow, it falls behind them.
Interesting though, why only 3 and 5?
Had a look at one of the tripped circuit breakers on n4, for r201. n4 was originally the leaseholder, but then transferred the lease to n3:
07:43:30.197271 33732 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n3,replicate,s3,r201/2:‹/Table/134/1/2000000{89…-94…}›] 702 transferring lease to s4
W
For some time, n3 was struggling to keep up with proposals:
W231005 07:43:34.982627 307 go.etcd.io/raft/v3/raft.go:799 ⋮ [T1,Vsystem,n3,s3,r201/2:‹/Table/134/1/2000000{89…-94…}›] 682 2 appending new entries to log would exceed uncommitted entry size limit; dropping proposal
n4's circuit breaker then trips on an AddSSTable that was submitted before the lease transfer:
E231005 07:43:54.894326 19930 kv/kvserver/replica_circuit_breaker.go:167 ⋮ [T1,Vsystem,n4,s4,r201/4:‹/Table/134/1/2000000{89…-94…}›] 1738 breaker: tripped with error: replica unavailable: (n4,s4):4 unable to serve request to r201:‹/Table/134/1/2000000{89890890-94780543}› [(n4,s4):4, (n3,s3):2, (n2,s2):3, next=5, gen=29, sticky=1696495264.371972605,0]: closed timestamp: 1696491763.182001913,0 (2023-10-05 07:42:43); raft status: {"id":"4","term":8,"vote":"0","commit":38,"lead":"2","raftState":"StateFollower","applied":38,"progress":{},"leadtransferee":"0"}: have been waiting 61.50s for slow proposal AddSSTable [/Table/134/1/‹200000090276075›/‹0›,/Table/134/1/‹200000090468864›/‹0›/‹NULL›)
W231005 07:44:31.776452 2037 kv/kvclient/kvcoord/dist_sender.go:1882 ⋮ [T1,Vsystem,n5,f‹eabe4ccd›,job=905777184646561793,distsql.gateway=1,distsql.appname=‹$ internal-resume-job-905777184646561793›] 881 slow range RPC: have been waiting 108.56s (1 attempts) for RPC AddSSTable [/Table/134/1/‹200000090276075›/‹0›,/Table/134/1/‹200000090468864›/‹0›/‹NULL›) to r201:‹/Table/134/1/2000000{89890890-94780543}› [(n4,s4):4, (n3,s3):2, (n2,s2):3, next=5, gen=29, sticky=1696495264.371972605,0]; resp: ‹(err: <nil>), *kvpb.AddSSTableResponse›
n3 then truncates its log, cutting n4 off:
I231005 07:44:56.197919 41026 kv/kvserver/raft_log_queue.go:725 ⋮ [T1,Vsystem,n3,raftlog,s3,r201/2:‹/Table/134/1/2000000{89…-94…}›] 774 should truncate: true [truncate 30 entries to first index 73 (chosen via: last index); log too large (146 MiB > 16 MiB); implies 1 Raft snapshot]
And it keeps trying to send snapshots to n4, which aren't going through because the node is overloaded:
W231005 07:47:19.560255 53123 13@kv/kvserver/replica_command.go:2908 ⋮ [T1,Vsystem,n3,raftsnapshot,s3,r201/2:‹/Table/134/1/2000000{89…-94…}›] 1058 attempt 1: delegate snapshot ‹range_id:201 coordinator_replica:<node_id:3 store_id:3 replica_id:2 type:VOTER_FULL > recipient_replica:<node_id:4 store_id:4 replica_id:4 type:VOTER_FULL > delegated_sender:<node_id:3 store_id:3 replica_id:2 type:VOTER_FULL > deprecated_priority:RECOVERY term:8 first_index:114 sender_queue_name:RAFT_SNAPSHOT_QUEUE descriptor_generation:29 queue_on_delegate_len:-1 snap_id:82bee3ac-4f19-45cb-a384-02aafd829485 › request failed error sending couldn't accept ‹range_id:201 coordinator_replica:<node_id:3 store_id:3 replica_id:2 type:VOTER_FULL > recipient_replica:<node_id:4 store_id:4 replica_id:4 type:VOTER_FULL > delegated_sender:<node_id:3 store_id:3 replica_id:2 type:VOTER_FULL > deprecated_priority:RECOVERY term:8 first_index:114 sender_queue_name:RAFT_SNAPSHOT_QUEUE descriptor_generation:29 queue_on_delegate_len:-1 snap_id:82bee3ac-4f19-45cb-a384-02aafd829485 ›: recv msg error: grpc: ‹giving up during snapshot reservation due to cluster setting "kv.snapshot_receiver.reservation_queue_timeout_fraction": context deadline exceeded› [code 4/DeadlineExceeded]
This is all consistent with n4 being overloaded and unable to accept replication traffic. It's unclear if the reproposal behavior here is expected (i.e. that n4 should keep reproposing the AddSSTable long after it lost the lease and leadership), so that's worth doublechecking, but beyond that this seems expected.
I'll also note that follower pausing triggered for n4 as well, both on r201 and r813.
The inbound Raft traffic on n4 is basically a flat line, at ~10 MB/s, which seems very suspect compared to the other nodes that are non-flat. Possibly because it's bottlenecked on disk or some other resource. However, the disk was pushing more throughput earlier in the test.
Raft scheduler latency is pretty atrocious, which I think also points towards slow IO:
roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ 45f6344a2fc7d2d52f410e1ecad98f8c928cfeea:
(monitor.go:153).Wait: monitor failure: pq: backup from version 22.2 is older than the minimum restorable version 23.1
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=8/cpus=8/run_1
Parameters: ROACHTEST_arch=amd64
, ROACHTEST_cloud=aws
, ROACHTEST_cpu=8
, ROACHTEST_encrypted=false
, ROACHTEST_fs=ext4
, ROACHTEST_localSSD=false
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) _Grafana is not yet available for aws clusters_
The new failure is unrelated to this issue, and is a dup of #112381. Adding x-noreuse
label here to avoid the spam.
To reiterate on IO once again, here we have metrics reported by the OS. These correspond to diskstats that we collect (ultimately calling this) from the host node.
We are seeing that write.time
as well as weightedio.time
grow at a rate of 1s/s, i.e. the host is doing IO all the time. We also see that the rate at which it does IO is constant, and sticks around 30 MB/s.
Something happens at 7:54, after which the IO rate drops to about half, 15 MB/s; as well as the time being in IO.
Both this test failure and #111160 have the following line in logs only on one node, and that happens to be the node with the slow IO (n4
in this run, and n5
in #111160). Other nodes print a different line.
$ rg nvme
...
4.unredacted/cockroach.log
48:I231005 07:40:42.517775 79 server/config.go:868 ⋮ [T1,Vsystem,n?] 15 store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=ext4, blkdev=‹/dev/nvme1n1›, mnt=‹/mnt/data1› opts=‹rw,relatime›}}
Other nodes print this:
48:I231005 07:40:45.889008 80 server/config.go:868 ⋮ [T1,Vsystem,n?] 15 store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1611012k,mode=700,uid=1000,gid=1000,inode64›}}
This points pretty conclusively that it's disk configuration related. Not sure how to interpret this. The printed struct is populated in this func, which has a large TODO saying that the result is not reliable. But anyway, this all likely means that nodes are not configured symmetrically.
Worth noting: I ran a happy run of this test, and collected configurations from logs, they look like this too - there is one (sometimes multiple) node configured differently from others:
$ grep "cockroach.log.*store.*FileStoreProp" confs.txt
logs/cockroach.log:I231017 10:29:12.018113 95 server/config.go:874 ⋮ [T1,Vsystem,n?] 15 store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=ext4, blkdev=‹/dev/nvme1n1›, mnt=‹/mnt/data1› opts=‹rw,relatime›}}
logs/cockroach.log:I231017 10:29:18.500440 89 server/config.go:874 ⋮ [T1,Vsystem,n?] 15 store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609336k,nr_inodes=402334,mode=700,uid=1000,gid=1000,inode64›}}
logs/cockroach.log:I231017 10:29:21.689279 77 server/config.go:874 ⋮ [T1,Vsystem,n?] 15 store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609336k,nr_inodes=402334,mode=700,uid=1000,gid=1000,inode64›}}
logs/cockroach.log:I231017 10:29:24.914879 80 server/config.go:874 ⋮ [T1,Vsystem,n?] 15 store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609336k,nr_inodes=402334,mode=700,uid=1000,gid=1000,inode64›}}
logs/cockroach.log:I231017 10:29:28.273792 78 server/config.go:874 ⋮ [T1,Vsystem,n?] 15 store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609332k,nr_inodes=402333,mode=700,uid=1000,gid=1000,inode64›}}
logs/cockroach.log:I231017 10:29:31.620834 86 server/config.go:874 ⋮ [T1,Vsystem,n?] 15 store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609336k,nr_inodes=402334,mode=700,uid=1000,gid=1000,inode64›}}
logs/cockroach.log:I231017 10:29:34.890229 76 server/config.go:874 ⋮ [T1,Vsystem,n?] 15 store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609332k,nr_inodes=402333,mode=700,uid=1000,gid=1000,inode64›}}
logs/cockroach.log:I231017 10:29:38.074884 81 server/config.go:874 ⋮ [T1,Vsystem,n?] 15 store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609336k,nr_inodes=402334,mode=700,uid=1000,gid=1000,inode64›}}
So it's not necessarily causing IO slowness unconditionally, but there is something in here.
Also:
$ rg "(mounted filesystem|user-1000)" *.journal* | less
// All nodes print something like:
6.journalctl.txt:Oct 05 07:32:34 ubuntu kernel: EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
6.journalctl.txt:Oct 05 07:33:03 ip-10-12-4-138 kernel: EXT4-fs (nvme1n1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
6.journalctl.txt:Oct 05 07:33:33 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.
6.journalctl.txt:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.
// Except this one:
4.journalctl.txt:Oct 05 07:32:48 ubuntu kernel: EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
4.journalctl.txt:Oct 05 07:36:18 ip-10-12-9-200 kernel: EXT4-fs (nvme1n1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
@srosenberg Is there some kind of race or metamorphism here in the environment initialization path?
I ran an experiment in which I set a limit of 15 MB/s on the IO throughput, using cgroups v2
(similarly to #112628).
The disk write graphs look very similar to what we've seen before:
The slow node n4
displays a flat line at nearly constant rate. Notably, there is also a period of time at the start of the test when the limit is exceeded (for 5 minutes the IO rate is 45-50 MB/s, and only later stabilizes at 15 MB/s).
One notable difference though is that the test still succeeds within 20 minutes or so, even though this one node is behind. I.e. we are successfully dealing with the slow node. After the test finishes (around 21:00), the cluster continues catching up this node.
Another difference is that the CPU on the slow node is the lowest, and doesn't spin up to 100% like in the failed runs:
Given that the IO graph looks very similar as if it was limited by cgroups
, I'm thinking that CPU might be limited as well. I'm going to run another experiment where I artificially lower both the IO and CPU capacity.
Side note: it would be worth adding some logging / introspection into the cockroach process cgroups
configuration. Printed at start and/or finish of the roachtest, and/or added to the debug.zip
.
Our CPU load measurement is relative to the getCPUCapacity which accounts for things like cgroups
. It would be reeeally useful here to know what this value is. My suspicion is that the 100% we're seeing could be relative to a lower limit. The host CPU is at 100% too though, which makes this clue weaker.
https://github.com/cockroachdb/cockroach/issues/111825#issuecomment-1766098790 is a nice find, something seems messed up about the node config here. Using tmpfs seems pretty wrong, and would explain a lot of this (although I'd expect a large memory blowup on those nodes, considering we're restoring a large backup).
I'm dropping the DR labels since we are not actively looking into this and have just been following along for updates from y'all. Feel free to re-add them if you need to rope us in!
@cockroachdb/test-eng Is it possible to check whether VMs from this roachtest run were previously used by other roachtests, and potentially identify those tests? There is a suspicion that this run inherited some leftover cgroups
limits.
Is it possible to check whether VMs from this roachtest run were previously used by other roachtests, and potentially identify those tests?
VM (or rather, cluster) reuse is indeed possible. If a test runs on a cluster and later we need to run a test that requires a cluster compatible with that existing cluster, it might be reused (after Wipe
).
This is not the case in this failure, however. If we check the test runner logs, we see that the cluster was created for this test, and was not reused after:
07:39:17 test_runner.go:687: [w4] Created new cluster for test restore/tpce/400GB/aws/nodes=8/cpus=8: teamcity-12059861-1696483956-28-n8cpu8 (arch="amd64")
08:43:32 test_runner.go:830: [w4] test failed: restore/tpce/400GB/aws/nodes=8/cpus=8 (run 1)
08:43:32 test_runner.go:851: [w4] destroying cluster teamcity-12059861-1696483956-28-n8cpu8 [tag:] (8 nodes) because: restore/tpce/400GB/aws/nodes=8/cpus=8 (1) - (test_runner.go:1150).runTest: test timed out (1h0m0s)
cc @cockroachdb/replication
Passing this issue to @cockroachdb/test-eng team.
We've done extensive investigation, and the root cause still seems to be an IO (possibly CPU too) constrained node. It would be nice to get a level below and understand the VM initialization sequence: how does it happen that it's different for different nodes, is it normal, etc?
We could also benefit from some observability improvements here:
node_exporter
metrics)I still have artifacts for this issue and #111160, LMK if you need them.
cc @cockroachdb/test-eng
Specifically, we would like to understand the correlation with tmpfs
and run-user-1000.mount
in https://github.com/cockroachdb/cockroach/issues/111825#issuecomment-1766098790. In both failures, the slow nodes (and only the slow nodes) reported ext4
instead of tmpfs
, and did not appear to execute run-user-1000.mount
. Is this just an unlikely coincidence, or is it related?
Specifically, we would like to understand the correlation with
tmpfs
andrun-user-1000.mount
in #111825 (comment). In both failures, the slow nodes (and only the slow nodes) reportedext4
instead oftmpfs
, and did not appear to executerun-user-1000.mount
. Is this just an unlikely coincidence, or is it related?
There is no correlation between tmpfs
and run-user-1000.mount
messages seen in the logs. The latter is mounted/unmounted by sshd
in response to remote ssh sessions; discussion here explains why it might be missing [1]. The occurrence of tmpfs
in cockroach.log
is bogus. It's a bug in getFileSystemProperties
[2], which should be fixed to stop misleading messages.
[1] https://unix.stackexchange.com/a/685632 [2] https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1697589992920519?thread_ts=1697024580.180369&cid=C023S0V4YEB
The correlation is with slow IO: the node which reported ext4 instead of tmpfs and which did not appear to run run-user-1000.mount
was also the node which had slow IO (and only that node). We have seen this in 2 out of 2 failures (admittedly a small sample size, but also suspicious enough to have a closer look).
A bit more precisely, it's not that run-user-1000.mount
wasn't run at all. But in both cases:
user-1000
thing did not run before the process was started.It looks as though either the node is initially slow, or maybe it wasn't entirely configured before the process started.
Another evidence that the nodes are not symmetric:
$ rg "Startup finished in"
6.journalctl.txt
1217:Oct 05 07:32:43 ip-10-12-4-138 systemd[838]: Startup finished in 42ms.
2113:Oct 05 07:33:23 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: Startup finished in 2.059s (kernel) + 48.820s (userspace) = 50.879s.
2181:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[6500]: Startup finished in 25ms.
2287:Oct 05 07:40:44 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[6900]: Startup finished in 25ms.
1.journalctl.txt
1209:Oct 05 07:32:42 ip-10-12-3-80 systemd[836]: Startup finished in 39ms.
2105:Oct 05 07:33:20 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[1]: Startup finished in 1.996s (kernel) + 47.348s (userspace) = 49.345s.
2174:Oct 05 07:38:52 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[6500]: Startup finished in 24ms.
2360:Oct 05 07:40:49 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[7495]: Startup finished in 24ms.
8.journalctl.txt
1215:Oct 05 07:32:44 ip-10-12-13-119 systemd[840]: Startup finished in 41ms.
2095:Oct 05 07:33:21 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[1]: Startup finished in 1.960s (kernel) + 46.254s (userspace) = 48.215s.
2164:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[6416]: Startup finished in 24ms.
2270:Oct 05 07:40:47 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[6815]: Startup finished in 24ms.
7.journalctl.txt
1215:Oct 05 07:32:44 ip-10-12-8-116 systemd[840]: Startup finished in 43ms.
2103:Oct 05 07:33:23 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[1]: Startup finished in 1.963s (kernel) + 49.298s (userspace) = 51.261s.
2168:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[6411]: Startup finished in 26ms.
2274:Oct 05 07:40:46 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[6812]: Startup finished in 25ms.
5.journalctl.txt
1216:Oct 05 07:32:44 ip-10-12-1-206 systemd[834]: Startup finished in 41ms.
2114:Oct 05 07:33:22 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[1]: Startup finished in 2.005s (kernel) + 49.150s (userspace) = 51.155s.
2183:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[6494]: Startup finished in 24ms.
2289:Oct 05 07:40:43 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[6896]: Startup finished in 24ms.
2.journalctl.txt
1224:Oct 05 07:32:44 ip-10-12-8-218 systemd[829]: Startup finished in 41ms.
2137:Oct 05 07:33:25 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[1]: Startup finished in 2.348s (kernel) + 51.366s (userspace) = 53.715s.
2202:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[6569]: Startup finished in 26ms.
2338:Oct 05 07:40:49 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[7146]: Startup finished in 26ms.
3.journalctl.txt
1214:Oct 05 07:32:42 ip-10-12-2-85 systemd[840]: Startup finished in 44ms.
2122:Oct 05 07:33:22 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: Startup finished in 1.960s (kernel) + 49.613s (userspace) = 51.573s.
2187:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[6489]: Startup finished in 24ms.
2293:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[6890]: Startup finished in 24ms.
2377:Oct 05 07:40:49 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[7076]: Startup finished in 24ms.
4.journalctl.txt
1239:Oct 05 07:33:32 ip-10-12-9-200 systemd[871]: Startup finished in 286ms.
2271:Oct 05 07:38:55 teamcity-12059861-1696483956-28-n8cpu8-0004 systemd[1]: Startup finished in 7.008s (kernel) + 6min 8.411s (userspace) = 6min 15.419s.
All nodes report startup times as ~50s, and ~25ms. Node n4
does it significantly slower.
Same thing happens in the other failed run. All nodes completed the startup in ~50s, whereas the slow n5
has this:
5.journalctl.txt
1248:Sep 23 08:54:09 ip-10-12-10-114 systemd[847]: Startup finished in 247ms.
2341:Sep 23 09:03:24 teamcity-11896183-1695446936-33-n10cpu8-0005 systemd[1]: Startup finished in 12.254s (kernel) + 10min 23.007s (userspace) = 10min 35.262s.
2483:Sep 23 09:06:12 teamcity-11896183-1695446936-33-n10cpu8-0005 systemd[8518]: Startup finished in 520ms.
The above happens before the cockroach binary is started, or even copied to the nodes.
@srosenberg Do we have some way of doing a mega-grep on all the recent artifacts (the journalctl
files) for AWS tests to see how frequently such a thing happens? Whether it happens for other tests too (but maybe this data-intensive test is just particularly sensitive)?
In artifacts for a couple of AWS roachtests that I found in teamcity, the startup times are (like for this test) under a minute. I think this is what we should expect pretty consistently. The 6 and 10 min are surprisingly high, and likely indicate some problem with this VM.
Maybe we could add some extra diagnostics with systemd-analyze
in our roachtest routines, to help narrowing this down next time such a failure occurs. This tool can even produce SVG/dot images breaking down the whole startup sequence on a timeline. We could then have some artifacts to present to AWS to look at.
I'm removing the release blocker label as this has become an infrastructure investigation. We haven't had the time to investigate yet, but it doesn't seem like it should block the release.
roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ 6b08842e45668287861af596c28dce58c352d77e:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=aws
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) _Grafana is not yet available for aws clusters_
/cc @cockroachdb/disaster-recovery
This test on roachdash | Improve this report!
Jira issue: CRDB-32080