cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.94k stars 3.79k forks source link

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [one node IO/CPU starved] #111825

Open cockroach-teamcity opened 11 months ago

cockroach-teamcity commented 11 months ago

roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ 6b08842e45668287861af596c28dce58c352d77e:

(test_runner.go:1150).runTest: test timed out (1h0m0s)
(monitor.go:153).Wait: monitor failure: dial tcp 18.227.107.138:26257: connect: connection refused
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=8/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) _Grafana is not yet available for aws clusters_

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-32080

adityamaru commented 11 months ago

looking into this

adityamaru commented 11 months ago

Job was 56% done:

905777184646561793      RESTORE RESTORE FROM 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=25000/v22.2.0/inc-count=48/2022/12/20-225543.90?AUTH=implicit' AS OF SYSTEM TIME '2022-12-21T02:00:00Z'              root    {115,132,120,146,123,113,112,143,140,119,133,109,105,116,157,141,138,121,106,129,136,142,156,144,147,118,127,150,139,107,111,114,128,154,152,151,126,135,145,122,131,110,124,137,117,108,125,155,130,153,149,148,134}        running NULL    2023-10-05 07:40:59.305935+00   2023-10-05 07:41:00.002923+00   NULL    2023-10-05 08:42:09.565124+00        0.5623053312301636      NULL            1       414291085975248250      2023-10-05 07:41:00.002924+00   2023-10-05 07:41:30.002924+00   1       NULL    NULL
adityamaru commented 11 months ago

The job starts at 07:40 and a couple of minutes later until the test times out we see slow range RPCs for AddSSTables. Towards the end of the test we see as high as 400s slow range RPCs.

❯ rg 'slow range RPC.*AddSSTable*' only-jobs | wc -l
      16

teamcity-12059861-1696483956-28-n8cpu8-0005> W231005 08:37:53.519417 51649 kv/kvclient/kvcoord/dist_sender.go:1882 ⋮ [T1,Vsystem,n5,f‹981c994f›,job=905777184646561793,distsql.gateway=1,dist
sql.appname=‹$ internal-resume-job-905777184646561793›] 5810  slow range RPC: have been waiting 473.09s (1 attempts) for RPC AddSSTable [/Table/139/1/‹200000172921039›/‹0›,/Table/139/1/‹200
000173048214›/‹2›/‹1›/‹NULL›) to r813:‹/Table/139/1/20000017{1649339-4998955}› [(n5,s5):4, (n7,s7):2, (n4,s4):5, next=6, gen=101, sticky=1696495568.402488951,0]; resp: ‹(err: <nil>), *kvpb.AddSSTableResponse›

Similarly theres 19 instances of slow AdminSplits and 3 of AdminScatter all in the 90-100s range.

After the job started at 07:40 we saw an inbox communication error and a retry at 07:45:

teamcity-12059861-1696483956-28-n8cpu8-0001> W231005 07:45:56.669497 12809 ccl/backupccl/restore_job.go:210 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=905777184646561793›] 1465  encountered retryable error: importing 938 ranges: ‹inbox communication error›: grpc: ‹context canceled› [code 1/Canceled]

We do now have much better checkpointing to avoid redoing done work and this shouldn't have a major impact on performance but its worth noting anyways. We have a very high number of connection reset IO retries when reading the backup from the bucket:

❯ rg 'connection reset' only-jobs | wc -l
    1626

This would definitely cause the restore to run for longer since we're retrying on a large number of the reads from s3. We also generally see raft struggling to keep up pretty much immediately after the restore starts. Messages like acquired raft proposal quota after 15.404145179s go up to almost 30s and are being logged constantly.

Overall, looks like the cluster was pretty overwhelmed and unable to serve KV requests. I will ping KV in case they're interested in giving this cluster a look.

pav-kv commented 11 months ago

Is this any related to #111160? Particularly, do we see one node spinning at 100% CPU?

Upd: Yes, looks a lot like it. We should checkout CPU profiles for the overloaded n4 here.

Screenshot 2023-10-11 at 10 05 51
pav-kv commented 11 months ago

The node spends 50% time in raft processing (and much of that is loading/storing sideloaded entries). Not that this looks particularly surprising though, for this test.

Screenshot 2023-10-11 at 10 15 00

It would be nice to compare with a healthy node profile, but we don't have it here. Can run the test manually and collect some.

pav-kv commented 11 months ago

What surprises me though is how nicely n4 stays just around 2 GiB of memory usage. Is it possible that we just somehow get a particularly under-provisioned node? We would OOM though, instead of backpressuring like this.

Screenshot 2023-10-11 at 10 24 22
blathers-crl[bot] commented 11 months ago

cc @cockroachdb/replication

pav-kv commented 11 months ago

Writes started at 07:41, and CPU almost immediately spiraled up to 100%.

Screenshot 2023-10-11 at 12 16 32 Screenshot 2023-10-11 at 12 16 41 Screenshot 2023-10-11 at 12 17 02 Screenshot 2023-10-11 at 12 18 41

Counter-intuitively, n4 peaks by all means, but processes the least amount of SSTs.

The first CPU profile (when usage is above 80%) was captured at 07:41:26. I don't see anything unordinary here, everything takes a fair share of CPU.

Screenshot 2023-10-11 at 12 21 36
pav-kv commented 11 months ago

One thing stood out in n4 stacks:

8915 goroutine 19884 [select, 59 minutes]:                                                                                                                                                                                              
8916 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc005960580, {0x79221b0, 0xc009318730}, 0xc004d08360, 0xc00234b500)                                                                                 
8917     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:205 +0x7a7                                                                                                                                   
8918 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc005960580, {0x79221b0, 0xc009318730}, 0xc004d08360, 0x66a9158)                                                                   
8919     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:511 +0x3c3                                                                                                                                    
8920 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes(0xc005960580, {0x79221b0?, 0xc009318730?}, 0xc004d08360)                                                                                            
8921     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:189 +0x751                                                                                                                                    
8922 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(0xc004d08360?, {0x79221b0?, 0xc009318730?}, 0x8000101?)                                                                                                           
8923     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:121 +0x25                                                                                                                                     
8924 github.com/cockroachdb/cockroach/pkg/kv/kvserver.sendProbe({0x79221b0, 0xc009318730}, {0x7955830, 0xc005960580})                                                                                                                   
8925     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:226 +0x14b                                                                                                                         
8926 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaCircuitBreaker).asyncProbe.func1({0x7922258?, 0xc0095e4ff0?})                                                                                                            
8927     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:200 +0xf7                                                                                                                          
8928 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()                                                                                                                                                   
8929     github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146                                                                                                                                                           
8930 created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx                                                                                                                                                
8931     github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b                                                                                                                                                           
8932                                                                         

@erikgrinaker are we ever allowed to run this method that long? It looks like we evaluated the thing but never got the result or cancel. It's a probe though, and also doesn't hold any locks.

erikgrinaker commented 11 months ago

We discussed this separately, but this does seem suspect -- we're basically waiting for the reproposal machinery to get this proposal through, and it isn't happening for whatever reason.

pav-kv commented 11 months ago

In this restore run, there is this periodic spike of proposals (from other node, n1, not the slow n4; n1 is probably the gateway). Every 2.5 minutes there is a spike to 35/s (I wonder why).

Screenshot 2023-10-12 at 09 44 09

Those are not AddSSTs though. AddSSTs are sent at a rate <2/s.

Screenshot 2023-10-12 at 09 46 07

Not reproposals either, we only see reproposals on n4:

Screenshot 2023-10-12 at 09 46 55

There are also only a couple of reproposals with a new LAI (NB: I made this graph show the absolute counter, not rate), from n4:

Screenshot 2023-10-12 at 09 51 47
pav-kv commented 11 months ago

Running a "happy" run to compare the behaviours. I do see the same periodic spikes from one node, so that's probably just the shape of the workload. There aren't many reproposals though, but notably it's not zero and constantly trickles at a rate of a few reproposals/minute.

pav-kv commented 11 months ago

In the happy run, the Raft portion of the CPU profile is hardly noticeable (looks similar on multiple nodes I've looked at):

Screenshot 2023-10-12 at 10 30 49

The profile we've seen above is almost double of this one, and noticeably has a lot more raft processing.

pav-kv commented 11 months ago

@sumeerbhola Could you help reading this graph (TLDR from above is that n4 is using 100% CPU and super slow)? Why are we seeing that the number of slots on n4 is significantly higher than on other nodes? What does it mean? We also see that the used slots on this node spiked 2 times. The CPU was 100% at all times though.

Screenshot 2023-10-12 at 15 23 41
erikgrinaker commented 11 months ago

The p50 log commit latency here is 10 seconds. Pretty sure this is just an overloaded disk or something.

Screenshot 2023-10-13 at 12 37 50
erikgrinaker commented 11 months ago

Unfortunately, we don't have CPU iowait metrics. However, most of the CPU usage appears to be user, not sys, and I think iowait would show up as sys.

Screenshot 2023-10-13 at 12 43 04
erikgrinaker commented 11 months ago

n3 and n5 are linearly falling behind on their Raft logs for some reason:

Screenshot 2023-10-13 at 12 44 55
pav-kv commented 11 months ago

Can CPU starvation cause this commit latency jump (i.e. it's just the measurement error rather than disk slowness)? It appears that both CPU and disk are slow, but I'm not sure which is the cause. Good point re iowait time and sys time. Maybe we should consider improving this observability aspect.

Conjecture re nodes 3 and 5: maybe there is some range that node 4 can't deliver to them. It still holds a small fraction of leases.

erikgrinaker commented 11 months ago

Many of the slow Raft readys are simply slow log appends, although there are a few during apply and other too.

teamcity-12059861-1696483956-28-n8cpu8-0004> I231005 07:41:18.626887 311 kv/kvserver/store_raft.go:699 ⋮ [T1,Vsystem,n4,s4,r200/4:‹/Table/134/1/20000008{05…-54…}›,raft] 387  raft ready handling: 2.12s [append=2.12s, apply=0.00s, non-blocking-sync=0.00s, other=0.00s], wrote [append-batch=264 B, append-sst=8.6 MiB (1), ]; node might be overloaded

We see replication AC firing a lot too (note to self: fix the log context and formatting here).

teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055274 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4462  stream t1/s1 was blocked: durations: tokens deducted: regular 0 B elastic 0 B
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055297 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4463  stream t1/s3 was blocked: durations: tokens deducted: regular 0 B elastic 0 B
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055310 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4464  stream t1/s4 was blocked: durations: elastic 4.470939686stokens deducted: regular 0 B elastic 40 MB
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055319 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4465  stream t1/s6 was blocked: durations: elastic 108.264µstokens deducted: regular 0 B elastic 74 MB
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055328 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4466  stream t1/s8 was blocked: durations: elastic 88.412183mstokens deducted: regular 0 B elastic 52 MB
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055334 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4467  stream t1/s7 was blocked: durations: tokens deducted: regular 0 B elastic 0 B
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055342 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4468  stream t1/s2 was blocked: durations: elastic 238.371768mstokens deducted: regular 0 B elastic 22 MB
teamcity-12059861-1696483956-28-n8cpu8-0006> I231005 08:28:36.055348 407 kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller_metrics.go:277 ⋮ [T1,-] 4469  stream t1/s5 was blocked: durations: tokens deducted: regular 0 B elastic 0 B
erikgrinaker commented 11 months ago

Can CPU starvation cause this commit latency jump (i.e. it's just the measurement error rather than disk slowness)?

I'd be very surprised if it stalled for 10 seconds.

The goroutine scheduling latency is pretty poor though, at 100ms.

Screenshot 2023-10-13 at 12 55 10
pav-kv commented 11 months ago

Conjecture re nodes 3 and 5: maybe there is some range that node 4 can't deliver to them. It still holds a small fraction of leases.

It's probably the other way around. The metric (on a leader) measures the number of entries that followers are behind on. In this case we probably have ranges on node 3 and 5 for which node 4 is a follower. Since 4 is slow, it falls behind them.

Interesting though, why only 3 and 5?

erikgrinaker commented 11 months ago

Had a look at one of the tripped circuit breakers on n4, for r201. n4 was originally the leaseholder, but then transferred the lease to n3:

07:43:30.197271 33732 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n3,replicate,s3,r201/2:‹/Table/134/1/2000000{89…-94…}›] 702  transferring lease to s4
W

For some time, n3 was struggling to keep up with proposals:

W231005 07:43:34.982627 307 go.etcd.io/raft/v3/raft.go:799 ⋮ [T1,Vsystem,n3,s3,r201/2:‹/Table/134/1/2000000{89…-94…}›] 682  2 appending new entries to log would exceed uncommitted entry size limit; dropping proposal

n4's circuit breaker then trips on an AddSSTable that was submitted before the lease transfer:

E231005 07:43:54.894326 19930 kv/kvserver/replica_circuit_breaker.go:167 ⋮ [T1,Vsystem,n4,s4,r201/4:‹/Table/134/1/2000000{89…-94…}›] 1738  breaker: tripped with error: replica unavailable: (n4,s4):4 unable to serve request to r201:‹/Table/134/1/2000000{89890890-94780543}› [(n4,s4):4, (n3,s3):2, (n2,s2):3, next=5, gen=29, sticky=1696495264.371972605,0]: closed timestamp: 1696491763.182001913,0 (2023-10-05 07:42:43); raft status: {"id":"4","term":8,"vote":"0","commit":38,"lead":"2","raftState":"StateFollower","applied":38,"progress":{},"leadtransferee":"0"}: have been waiting 61.50s for slow proposal AddSSTable [/Table/134/1/‹200000090276075›/‹0›,/Table/134/1/‹200000090468864›/‹0›/‹NULL›)
W231005 07:44:31.776452 2037 kv/kvclient/kvcoord/dist_sender.go:1882 ⋮ [T1,Vsystem,n5,f‹eabe4ccd›,job=905777184646561793,distsql.gateway=1,distsql.appname=‹$ internal-resume-job-905777184646561793›] 881  slow range RPC: have been waiting 108.56s (1 attempts) for RPC AddSSTable [/Table/134/1/‹200000090276075›/‹0›,/Table/134/1/‹200000090468864›/‹0›/‹NULL›) to r201:‹/Table/134/1/2000000{89890890-94780543}› [(n4,s4):4, (n3,s3):2, (n2,s2):3, next=5, gen=29, sticky=1696495264.371972605,0]; resp: ‹(err: <nil>), *kvpb.AddSSTableResponse›

n3 then truncates its log, cutting n4 off:

I231005 07:44:56.197919 41026 kv/kvserver/raft_log_queue.go:725 ⋮ [T1,Vsystem,n3,raftlog,s3,r201/2:‹/Table/134/1/2000000{89…-94…}›] 774  should truncate: true [truncate 30 entries to first index 73 (chosen via: last index); log too large (146 MiB > 16 MiB); implies 1 Raft snapshot]

And it keeps trying to send snapshots to n4, which aren't going through because the node is overloaded:

W231005 07:47:19.560255 53123 13@kv/kvserver/replica_command.go:2908 ⋮ [T1,Vsystem,n3,raftsnapshot,s3,r201/2:‹/Table/134/1/2000000{89…-94…}›] 1058  attempt 1: delegate snapshot ‹range_id:201 coordinator_replica:<node_id:3 store_id:3 replica_id:2 type:VOTER_FULL > recipient_replica:<node_id:4 store_id:4 replica_id:4 type:VOTER_FULL > delegated_sender:<node_id:3 store_id:3 replica_id:2 type:VOTER_FULL > deprecated_priority:RECOVERY term:8 first_index:114 sender_queue_name:RAFT_SNAPSHOT_QUEUE descriptor_generation:29 queue_on_delegate_len:-1 snap_id:82bee3ac-4f19-45cb-a384-02aafd829485 › request failed error sending couldn't accept ‹range_id:201 coordinator_replica:<node_id:3 store_id:3 replica_id:2 type:VOTER_FULL > recipient_replica:<node_id:4 store_id:4 replica_id:4 type:VOTER_FULL > delegated_sender:<node_id:3 store_id:3 replica_id:2 type:VOTER_FULL > deprecated_priority:RECOVERY term:8 first_index:114 sender_queue_name:RAFT_SNAPSHOT_QUEUE descriptor_generation:29 queue_on_delegate_len:-1 snap_id:82bee3ac-4f19-45cb-a384-02aafd829485 ›: recv msg error: grpc: ‹giving up during snapshot reservation due to cluster setting "kv.snapshot_receiver.reservation_queue_timeout_fraction": context deadline exceeded› [code 4/DeadlineExceeded]

This is all consistent with n4 being overloaded and unable to accept replication traffic. It's unclear if the reproposal behavior here is expected (i.e. that n4 should keep reproposing the AddSSTable long after it lost the lease and leadership), so that's worth doublechecking, but beyond that this seems expected.

erikgrinaker commented 11 months ago

I'll also note that follower pausing triggered for n4 as well, both on r201 and r813.

erikgrinaker commented 11 months ago

The inbound Raft traffic on n4 is basically a flat line, at ~10 MB/s, which seems very suspect compared to the other nodes that are non-flat. Possibly because it's bottlenecked on disk or some other resource. However, the disk was pushing more throughput earlier in the test.

Screenshot 2023-10-13 at 14 46 30 Screenshot 2023-10-13 at 15 05 49
erikgrinaker commented 11 months ago

Raft scheduler latency is pretty atrocious, which I think also points towards slow IO:

Screenshot 2023-10-13 at 15 21 43
cockroach-teamcity commented 11 months ago

roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ 45f6344a2fc7d2d52f410e1ecad98f8c928cfeea:

(monitor.go:153).Wait: monitor failure: pq: backup from version 22.2 is older than the minimum restorable version 23.1
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=8/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) _Grafana is not yet available for aws clusters_

This test on roachdash | Improve this report!

pav-kv commented 11 months ago

The new failure is unrelated to this issue, and is a dup of #112381. Adding x-noreuse label here to avoid the spam.

pav-kv commented 11 months ago

To reiterate on IO once again, here we have metrics reported by the OS. These correspond to diskstats that we collect (ultimately calling this) from the host node.

We are seeing that write.time as well as weightedio.time grow at a rate of 1s/s, i.e. the host is doing IO all the time. We also see that the rate at which it does IO is constant, and sticks around 30 MB/s.

Something happens at 7:54, after which the IO rate drops to about half, 15 MB/s; as well as the time being in IO.

Screenshot 2023-10-17 at 09 19 20 Screenshot 2023-10-17 at 09 21 55
Same graphs, but showing the approx rate Screenshot 2023-10-17 at 09 09 19 Screenshot 2023-10-17 at 09 09 58
pav-kv commented 11 months ago

Both this test failure and #111160 have the following line in logs only on one node, and that happens to be the node with the slow IO (n4 in this run, and n5 in #111160). Other nodes print a different line.

$ rg nvme
...
4.unredacted/cockroach.log
48:I231005 07:40:42.517775 79 server/config.go:868 ⋮ [T1,Vsystem,n?] 15  store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=ext4, blkdev=‹/dev/nvme1n1›, mnt=‹/mnt/data1› opts=‹rw,relatime›}}

Other nodes print this:

48:I231005 07:40:45.889008 80 server/config.go:868 ⋮ [T1,Vsystem,n?] 15  store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1611012k,mode=700,uid=1000,gid=1000,inode64›}}

This points pretty conclusively that it's disk configuration related. Not sure how to interpret this. The printed struct is populated in this func, which has a large TODO saying that the result is not reliable. But anyway, this all likely means that nodes are not configured symmetrically.

Worth noting: I ran a happy run of this test, and collected configurations from logs, they look like this too - there is one (sometimes multiple) node configured differently from others:

$ grep "cockroach.log.*store.*FileStoreProp" confs.txt
        logs/cockroach.log:I231017 10:29:12.018113 95 server/config.go:874 ⋮ [T1,Vsystem,n?] 15  store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=ext4, blkdev=‹/dev/nvme1n1›, mnt=‹/mnt/data1› opts=‹rw,relatime›}}
        logs/cockroach.log:I231017 10:29:18.500440 89 server/config.go:874 ⋮ [T1,Vsystem,n?] 15  store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609336k,nr_inodes=402334,mode=700,uid=1000,gid=1000,inode64›}}
        logs/cockroach.log:I231017 10:29:21.689279 77 server/config.go:874 ⋮ [T1,Vsystem,n?] 15  store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609336k,nr_inodes=402334,mode=700,uid=1000,gid=1000,inode64›}}
        logs/cockroach.log:I231017 10:29:24.914879 80 server/config.go:874 ⋮ [T1,Vsystem,n?] 15  store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609336k,nr_inodes=402334,mode=700,uid=1000,gid=1000,inode64›}}
        logs/cockroach.log:I231017 10:29:28.273792 78 server/config.go:874 ⋮ [T1,Vsystem,n?] 15  store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609332k,nr_inodes=402333,mode=700,uid=1000,gid=1000,inode64›}}
        logs/cockroach.log:I231017 10:29:31.620834 86 server/config.go:874 ⋮ [T1,Vsystem,n?] 15  store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609336k,nr_inodes=402334,mode=700,uid=1000,gid=1000,inode64›}}
        logs/cockroach.log:I231017 10:29:34.890229 76 server/config.go:874 ⋮ [T1,Vsystem,n?] 15  store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609332k,nr_inodes=402333,mode=700,uid=1000,gid=1000,inode64›}}
        logs/cockroach.log:I231017 10:29:38.074884 81 server/config.go:874 ⋮ [T1,Vsystem,n?] 15  store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/mnt/data1/cockroach›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/run/user/1000› opts=‹rw,nosuid,nodev,relatime,size=1609336k,nr_inodes=402334,mode=700,uid=1000,gid=1000,inode64›}}

So it's not necessarily causing IO slowness unconditionally, but there is something in here.

Also:

$ rg "(mounted filesystem|user-1000)" *.journal* | less

// All nodes print something like:
6.journalctl.txt:Oct 05 07:32:34 ubuntu kernel: EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
6.journalctl.txt:Oct 05 07:33:03 ip-10-12-4-138 kernel: EXT4-fs (nvme1n1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
6.journalctl.txt:Oct 05 07:33:33 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.
6.journalctl.txt:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded.

// Except this one:
4.journalctl.txt:Oct 05 07:32:48 ubuntu kernel: EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
4.journalctl.txt:Oct 05 07:36:18 ip-10-12-9-200 kernel: EXT4-fs (nvme1n1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.

@srosenberg Is there some kind of race or metamorphism here in the environment initialization path?

pav-kv commented 11 months ago

I ran an experiment in which I set a limit of 15 MB/s on the IO throughput, using cgroups v2 (similarly to #112628).

The disk write graphs look very similar to what we've seen before:

Graphs Screenshot 2023-10-18 at 22 35 01 Screenshot 2023-10-18 at 22 35 28

The slow node n4 displays a flat line at nearly constant rate. Notably, there is also a period of time at the start of the test when the limit is exceeded (for 5 minutes the IO rate is 45-50 MB/s, and only later stabilizes at 15 MB/s).

One notable difference though is that the test still succeeds within 20 minutes or so, even though this one node is behind. I.e. we are successfully dealing with the slow node. After the test finishes (around 21:00), the cluster continues catching up this node.

Another difference is that the CPU on the slow node is the lowest, and doesn't spin up to 100% like in the failed runs:

Graph Screenshot 2023-10-18 at 22 41 40

Given that the IO graph looks very similar as if it was limited by cgroups, I'm thinking that CPU might be limited as well. I'm going to run another experiment where I artificially lower both the IO and CPU capacity.

pav-kv commented 11 months ago

Side note: it would be worth adding some logging / introspection into the cockroach process cgroups configuration. Printed at start and/or finish of the roachtest, and/or added to the debug.zip.

Our CPU load measurement is relative to the getCPUCapacity which accounts for things like cgroups. It would be reeeally useful here to know what this value is. My suspicion is that the 100% we're seeing could be relative to a lower limit. The host CPU is at 100% too though, which makes this clue weaker.

erikgrinaker commented 11 months ago

https://github.com/cockroachdb/cockroach/issues/111825#issuecomment-1766098790 is a nice find, something seems messed up about the node config here. Using tmpfs seems pretty wrong, and would explain a lot of this (although I'd expect a large memory blowup on those nodes, considering we're restoring a large backup).

adityamaru commented 11 months ago

I'm dropping the DR labels since we are not actively looking into this and have just been following along for updates from y'all. Feel free to re-add them if you need to rope us in!

pav-kv commented 11 months ago

@cockroachdb/test-eng Is it possible to check whether VMs from this roachtest run were previously used by other roachtests, and potentially identify those tests? There is a suspicion that this run inherited some leftover cgroups limits.

renatolabs commented 11 months ago

Is it possible to check whether VMs from this roachtest run were previously used by other roachtests, and potentially identify those tests?

VM (or rather, cluster) reuse is indeed possible. If a test runs on a cluster and later we need to run a test that requires a cluster compatible with that existing cluster, it might be reused (after Wipe).

This is not the case in this failure, however. If we check the test runner logs, we see that the cluster was created for this test, and was not reused after:

07:39:17 test_runner.go:687: [w4] Created new cluster for test restore/tpce/400GB/aws/nodes=8/cpus=8: teamcity-12059861-1696483956-28-n8cpu8 (arch="amd64")

08:43:32 test_runner.go:830: [w4] test failed: restore/tpce/400GB/aws/nodes=8/cpus=8 (run 1) 08:43:32 test_runner.go:851: [w4] destroying cluster teamcity-12059861-1696483956-28-n8cpu8 [tag:] (8 nodes) because: restore/tpce/400GB/aws/nodes=8/cpus=8 (1) - (test_runner.go:1150).runTest: test timed out (1h0m0s)

blathers-crl[bot] commented 11 months ago

cc @cockroachdb/replication

pav-kv commented 11 months ago

Passing this issue to @cockroachdb/test-eng team.

We've done extensive investigation, and the root cause still seems to be an IO (possibly CPU too) constrained node. It would be nice to get a level below and understand the VM initialization sequence: how does it happen that it's different for different nodes, is it normal, etc?

We could also benefit from some observability improvements here:

I still have artifacts for this issue and #111160, LMK if you need them.

blathers-crl[bot] commented 11 months ago

cc @cockroachdb/test-eng

erikgrinaker commented 11 months ago

Specifically, we would like to understand the correlation with tmpfs and run-user-1000.mount in https://github.com/cockroachdb/cockroach/issues/111825#issuecomment-1766098790. In both failures, the slow nodes (and only the slow nodes) reported ext4 instead of tmpfs, and did not appear to execute run-user-1000.mount. Is this just an unlikely coincidence, or is it related?

srosenberg commented 11 months ago

Specifically, we would like to understand the correlation with tmpfs and run-user-1000.mount in #111825 (comment). In both failures, the slow nodes (and only the slow nodes) reported ext4 instead of tmpfs, and did not appear to execute run-user-1000.mount. Is this just an unlikely coincidence, or is it related?

There is no correlation between tmpfs and run-user-1000.mount messages seen in the logs. The latter is mounted/unmounted by sshd in response to remote ssh sessions; discussion here explains why it might be missing [1]. The occurrence of tmpfs in cockroach.log is bogus. It's a bug in getFileSystemProperties [2], which should be fixed to stop misleading messages.

[1] https://unix.stackexchange.com/a/685632 [2] https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1697589992920519?thread_ts=1697024580.180369&cid=C023S0V4YEB

erikgrinaker commented 11 months ago

The correlation is with slow IO: the node which reported ext4 instead of tmpfs and which did not appear to run run-user-1000.mount was also the node which had slow IO (and only that node). We have seen this in 2 out of 2 failures (admittedly a small sample size, but also suspicious enough to have a closer look).

pav-kv commented 11 months ago

A bit more precisely, it's not that run-user-1000.mount wasn't run at all. But in both cases:

  1. The user-1000 thing did not run before the process was started.
  2. The "slow" node has a noticeable gap in when it was started.
For completeness, here is the log from run #111160 ``` $ rg user-1000 8.journalctl.txt 2135:Sep 23 08:53:40 teamcity-11896183-1695446936-33-n10cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded. 2239:Sep 23 09:04:43 teamcity-11896183-1695446936-33-n10cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded. 2303:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded. 10.journalctl.txt 2152:Sep 23 08:53:44 teamcity-11896183-1695446936-33-n10cpu8-0010 systemd[1]: run-user-1000.mount: Succeeded. 2258:Sep 23 09:04:43 teamcity-11896183-1695446936-33-n10cpu8-0010 systemd[1]: run-user-1000.mount: Succeeded. 2322:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0010 systemd[1]: run-user-1000.mount: Succeeded. 6.journalctl.txt 2122:Sep 23 08:53:38 teamcity-11896183-1695446936-33-n10cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded. 2224:Sep 23 09:04:43 teamcity-11896183-1695446936-33-n10cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded. 2288:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded. 1.journalctl.txt 2157:Sep 23 08:53:40 teamcity-11896183-1695446936-33-n10cpu8-0001 systemd[1]: run-user-1000.mount: Succeeded. 2353:Sep 23 09:05:38 teamcity-11896183-1695446936-33-n10cpu8-0001 systemd[1]: run-user-1000.mount: Succeeded. 9.journalctl.txt 2144:Sep 23 08:53:41 teamcity-11896183-1695446936-33-n10cpu8-0009 systemd[1]: run-user-1000.mount: Succeeded. 2246:Sep 23 09:04:42 teamcity-11896183-1695446936-33-n10cpu8-0009 systemd[1]: run-user-1000.mount: Succeeded. 2310:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0009 systemd[1]: run-user-1000.mount: Succeeded. 7.journalctl.txt 2136:Sep 23 08:53:41 teamcity-11896183-1695446936-33-n10cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded. 2238:Sep 23 09:04:41 teamcity-11896183-1695446936-33-n10cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded. 2303:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded. 5.journalctl.txt 2463:Sep 23 09:06:10 teamcity-11896183-1695446936-33-n10cpu8-0005 systemd[1]: run-user-1000.mount: Succeeded. 2.journalctl.txt 2133:Sep 23 08:53:37 teamcity-11896183-1695446936-33-n10cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded. 2235:Sep 23 09:04:42 teamcity-11896183-1695446936-33-n10cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded. 2299:Sep 23 09:05:24 teamcity-11896183-1695446936-33-n10cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded. 2383:Sep 23 09:05:39 teamcity-11896183-1695446936-33-n10cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded. 3.journalctl.txt 2144:Sep 23 08:53:39 teamcity-11896183-1695446936-33-n10cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded. 2259:Sep 23 09:05:24 teamcity-11896183-1695446936-33-n10cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded. 2343:Sep 23 09:05:41 teamcity-11896183-1695446936-33-n10cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded. 4.journalctl.txt 2145:Sep 23 08:53:39 teamcity-11896183-1695446936-33-n10cpu8-0004 systemd[1]: run-user-1000.mount: Succeeded. 2257:Sep 23 09:05:25 teamcity-11896183-1695446936-33-n10cpu8-0004 systemd[1]: run-user-1000.mount: Succeeded. 2341:Sep 23 09:05:42 teamcity-11896183-1695446936-33-n10cpu8-0004 systemd[1]: run-user-1000.mount: Succeeded. ``` and the corresponding `test.log`: ``` 09:05:24 cluster.go:709: test status: 09:05:24 cluster.go:709: test status: starting cluster 09:05:24 cockroach.go:312: teamcity-11896183-1695446936-33-n10cpu8: starting nodes 09:05:25 cockroach.go:812: teamcity-11896183-1695446936-33-n10cpu8: initializing cluster 09:05:26 cockroach.go:822: warning: --url specifies user/password, but command "init" does not accept user/password details - details ignored Cluster successfully initialized 09:05:26 cockroach.go:831: teamcity-11896183-1695446936-33-n10cpu8: setting cluster settings 09:05:28 cockroach.go:841: SET CLUSTER SETTING SET CLUSTER SETTING 09:06:09 restore.go:450: test status: running setup statements 09:06:09 restore.go:462: test status: running restore 09:06:10 monitor.go:177: Monitor event: n8: cockroach process is running (PID: 6994) 09:06:10 monitor.go:177: Monitor event: n9: cockroach process is running (PID: 6983) 09:06:10 monitor.go:177: Monitor event: n10: cockroach process is running (PID: 6994) 09:06:10 monitor.go:177: Monitor event: n6: cockroach process is running (PID: 6913) 09:06:10 monitor.go:177: Monitor event: n7: cockroach process is running (PID: 6986) 09:06:10 monitor.go:177: Monitor event: n2: cockroach process is running (PID: 6922) 09:06:10 monitor.go:177: Monitor event: n1: cockroach process is running (PID: 7217) 09:06:10 monitor.go:177: Monitor event: n3: cockroach process is running (PID: 6998) 09:06:10 monitor.go:177: Monitor event: n4: cockroach process is running (PID: 6982) 09:06:23 monitor.go:177: Monitor event: n5: cockroach process is running (PID: 8487) ``` NB: node 5 is the only one who didn't see `mount-1000` events before the node was started (see timestamps). It was also started noticeably later from all the other nodes (the gap is 13 seconds).
We observe a similar effect in this issue The slow node `n4` doesn't show `mount-1000` at all (others do): ``` artifacts/2.journalctl.txt 2173:Oct 05 07:33:35 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded. 2312:Oct 05 07:40:35 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[1]: run-user-1000.mount: Succeeded. artifacts/5.journalctl.txt 2154:Oct 05 07:33:33 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[1]: run-user-1000.mount: Succeeded. 2263:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[1]: run-user-1000.mount: Succeeded. artifacts/6.journalctl.txt 2149:Oct 05 07:33:33 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded. 2261:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: run-user-1000.mount: Succeeded. artifacts/1.journalctl.txt 2145:Oct 05 07:33:30 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[1]: run-user-1000.mount: Succeeded. 2334:Oct 05 07:40:34 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[1]: run-user-1000.mount: Succeeded. artifacts/8.journalctl.txt 2135:Oct 05 07:33:32 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded. 2244:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[1]: run-user-1000.mount: Succeeded. artifacts/3.journalctl.txt 2158:Oct 05 07:33:32 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded. 2267:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded. 2351:Oct 05 07:40:37 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: run-user-1000.mount: Succeeded. artifacts/7.journalctl.txt 2139:Oct 05 07:33:34 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded. 2248:Oct 05 07:40:26 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[1]: run-user-1000.mount: Succeeded. ``` Slow node `n4` has a start gap from all other nodes: ``` 07:40:20 cluster.go:713: test status: 07:40:20 cluster.go:713: test status: starting cluster 07:40:20 cockroach.go:312: teamcity-12059861-1696483956-28-n8cpu8: starting nodes 07:40:21 cockroach.go:814: teamcity-12059861-1696483956-28-n8cpu8: initializing cluster 07:40:22 cockroach.go:824: warning: --url specifies user/password, but command "init" does not accept user/password details - details ignored Cluster successfully initialized 07:40:22 cockroach.go:833: teamcity-12059861-1696483956-28-n8cpu8: setting cluster settings 07:40:23 cockroach.go:843: SET CLUSTER SETTING SET CLUSTER SETTING 07:40:49 restore.go:450: test status: running setup statements 07:40:49 restore.go:469: test status: running restore 07:40:50 monitor.go:177: Monitor event: n8: cockroach process is running (PID: 6979) 07:40:50 monitor.go:177: Monitor event: n6: cockroach process is running (PID: 7063) 07:40:50 monitor.go:177: Monitor event: n7: cockroach process is running (PID: 6975) 07:40:50 monitor.go:177: Monitor event: n5: cockroach process is running (PID: 7059) 07:40:50 monitor.go:177: Monitor event: n1: cockroach process is running (PID: 7287) 07:40:50 monitor.go:177: Monitor event: n3: cockroach process is running (PID: 7053) 07:40:50 monitor.go:177: Monitor event: n2: cockroach process is running (PID: 7122) 07:40:55 monitor.go:177: Monitor event: n4: cockroach process is running (PID: 8002) ```

It looks as though either the node is initially slow, or maybe it wasn't entirely configured before the process started.

pav-kv commented 11 months ago

Another evidence that the nodes are not symmetric:

$ rg "Startup finished in"
6.journalctl.txt
1217:Oct 05 07:32:43 ip-10-12-4-138 systemd[838]: Startup finished in 42ms.
2113:Oct 05 07:33:23 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[1]: Startup finished in 2.059s (kernel) + 48.820s (userspace) = 50.879s.
2181:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[6500]: Startup finished in 25ms.
2287:Oct 05 07:40:44 teamcity-12059861-1696483956-28-n8cpu8-0006 systemd[6900]: Startup finished in 25ms.

1.journalctl.txt
1209:Oct 05 07:32:42 ip-10-12-3-80 systemd[836]: Startup finished in 39ms.
2105:Oct 05 07:33:20 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[1]: Startup finished in 1.996s (kernel) + 47.348s (userspace) = 49.345s.
2174:Oct 05 07:38:52 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[6500]: Startup finished in 24ms.
2360:Oct 05 07:40:49 teamcity-12059861-1696483956-28-n8cpu8-0001 systemd[7495]: Startup finished in 24ms.

8.journalctl.txt
1215:Oct 05 07:32:44 ip-10-12-13-119 systemd[840]: Startup finished in 41ms.
2095:Oct 05 07:33:21 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[1]: Startup finished in 1.960s (kernel) + 46.254s (userspace) = 48.215s.
2164:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[6416]: Startup finished in 24ms.
2270:Oct 05 07:40:47 teamcity-12059861-1696483956-28-n8cpu8-0008 systemd[6815]: Startup finished in 24ms.

7.journalctl.txt
1215:Oct 05 07:32:44 ip-10-12-8-116 systemd[840]: Startup finished in 43ms.
2103:Oct 05 07:33:23 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[1]: Startup finished in 1.963s (kernel) + 49.298s (userspace) = 51.261s.
2168:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[6411]: Startup finished in 26ms.
2274:Oct 05 07:40:46 teamcity-12059861-1696483956-28-n8cpu8-0007 systemd[6812]: Startup finished in 25ms.

5.journalctl.txt
1216:Oct 05 07:32:44 ip-10-12-1-206 systemd[834]: Startup finished in 41ms.
2114:Oct 05 07:33:22 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[1]: Startup finished in 2.005s (kernel) + 49.150s (userspace) = 51.155s.
2183:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[6494]: Startup finished in 24ms.
2289:Oct 05 07:40:43 teamcity-12059861-1696483956-28-n8cpu8-0005 systemd[6896]: Startup finished in 24ms.

2.journalctl.txt
1224:Oct 05 07:32:44 ip-10-12-8-218 systemd[829]: Startup finished in 41ms.
2137:Oct 05 07:33:25 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[1]: Startup finished in 2.348s (kernel) + 51.366s (userspace) = 53.715s.
2202:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[6569]: Startup finished in 26ms.
2338:Oct 05 07:40:49 teamcity-12059861-1696483956-28-n8cpu8-0002 systemd[7146]: Startup finished in 26ms.

3.journalctl.txt
1214:Oct 05 07:32:42 ip-10-12-2-85 systemd[840]: Startup finished in 44ms.
2122:Oct 05 07:33:22 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[1]: Startup finished in 1.960s (kernel) + 49.613s (userspace) = 51.573s.
2187:Oct 05 07:38:53 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[6489]: Startup finished in 24ms.
2293:Oct 05 07:40:25 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[6890]: Startup finished in 24ms.
2377:Oct 05 07:40:49 teamcity-12059861-1696483956-28-n8cpu8-0003 systemd[7076]: Startup finished in 24ms.

4.journalctl.txt
1239:Oct 05 07:33:32 ip-10-12-9-200 systemd[871]: Startup finished in 286ms.
2271:Oct 05 07:38:55 teamcity-12059861-1696483956-28-n8cpu8-0004 systemd[1]: Startup finished in 7.008s (kernel) + 6min 8.411s (userspace) = 6min 15.419s.

All nodes report startup times as ~50s, and ~25ms. Node n4 does it significantly slower.

Same thing happens in the other failed run. All nodes completed the startup in ~50s, whereas the slow n5 has this:

5.journalctl.txt
1248:Sep 23 08:54:09 ip-10-12-10-114 systemd[847]: Startup finished in 247ms.
2341:Sep 23 09:03:24 teamcity-11896183-1695446936-33-n10cpu8-0005 systemd[1]: Startup finished in 12.254s (kernel) + 10min 23.007s (userspace) = 10min 35.262s.
2483:Sep 23 09:06:12 teamcity-11896183-1695446936-33-n10cpu8-0005 systemd[8518]: Startup finished in 520ms.
pav-kv commented 11 months ago

The above happens before the cockroach binary is started, or even copied to the nodes.

pav-kv commented 11 months ago

@srosenberg Do we have some way of doing a mega-grep on all the recent artifacts (the journalctl files) for AWS tests to see how frequently such a thing happens? Whether it happens for other tests too (but maybe this data-intensive test is just particularly sensitive)?

pav-kv commented 11 months ago

In artifacts for a couple of AWS roachtests that I found in teamcity, the startup times are (like for this test) under a minute. I think this is what we should expect pretty consistently. The 6 and 10 min are surprisingly high, and likely indicate some problem with this VM.

Maybe we could add some extra diagnostics with systemd-analyze in our roachtest routines, to help narrowing this down next time such a failure occurs. This tool can even produce SVG/dot images breaking down the whole startup sequence on a timeline. We could then have some artifacts to present to AWS to look at.

renatolabs commented 10 months ago

I'm removing the release blocker label as this has become an infrastructure investigation. We haven't had the time to investigate yet, but it doesn't seem like it should block the release.