roachtest: kv/splits/nodes=3/quiesce=false failed kv/splits/nodes=3/quiesce=true failed [known cause]

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

30.21k stars 3.82k forks source link

roachtest: kv/splits/nodes=3/quiesce=false failed kv/splits/nodes=3/quiesce=true failed [known cause] #30832

Closed cockroach-teamcity closed 5 years ago

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/4e80b4a591764fa75c43b3cb29f384e1c9842a49

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=kv/splits/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=934753&tab=buildLog

The test failed on master:
    test.go:500,cluster.go:890,kv.go:185,cluster.go:1199,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-934753-kv-splits-nodes-3:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        5:36.202633 438 workload/workload.go:456  finished 360000 of 500000 splits
        I181001 12:15:49.395890 438 workload/workload.go:456  finished 361000 of 500000 splits
        I181001 12:15:58.407011 438 workload/workload.go:456  finished 362000 of 500000 splits
        I181001 12:16:07.264775 438 workload/workload.go:456  finished 363000 of 500000 splits
        I181001 12:16:18.555317 438 workload/workload.go:456  finished 364000 of 500000 splits
        I181001 12:16:28.434512 438 workload/workload.go:456  finished 365000 of 500000 splits
        I181001 12:16:35.897020 438 workload/workload.go:456  finished 366000 of 500000 splits
        I181001 12:16:47.130857 438 workload/workload.go:456  finished 367000 of 500000 splits
        I181001 12:16:56.619298 438 workload/workload.go:456  finished 368000 of 500000 splits
        I181001 12:17:04.713195 438 workload/workload.go:456  finished 369000 of 500000 splits
        Error: ALTER TABLE kv SPLIT AT VALUES (5153954646052935680): pq: split at key /Table/53/1/5153954646052935680 failed: context deadline exceeded
        Error:  exit status 1
        : exit status 1
    test.go:500,cluster.go:1220,kv.go:188: Goexit() was called

tbg commented 6 years ago

Hmm. This is new, right @nvanbenschoten? Last I saw this we didn't have nobarrier set (?).

nvanbenschoten commented 6 years ago

I think we've seen this a few times recently. For instance, https://github.com/cockroachdb/cockroach/issues/30745.

It's pretty worrying, so I'm prioritizing it.

nvanbenschoten commented 6 years ago

The logs don't show a lot. Everything was moving along fine until around 12:17:13, when all three nodes started experiencing node liveness failures. These are the logs around that time on node1:

... lots of new range lease logs
I181001 12:17:12.446855 175 storage/replica_proposal.go:211  [n1,s1,r369521/1:/Table/53/1/6688{4993â€¦-5362â€¦}] new range lease repl=(n1,s1):1 seq=9 start=1538396226.439144511,0 epo=1 pro=1538396226.439548049,0 following repl=(n3,s3):3 seq=8 start=1538393729.489096306,0 epo=1 pro=1538393729.489097748,0
I181001 12:17:12.655041 170 storage/replica_proposal.go:211  [n1,s1,r369527/1:/Table/53/1/54386{2423â€¦-6112â€¦}] new range lease repl=(n1,s1):1 seq=10 start=1538396226.720070150,0 epo=1 pro=1538396226.720072218,0 following repl=(n3,s3):3 seq=9 start=1538393842.949011334,0 epo=1 pro=1538393842.949013381,0
E181001 12:17:13.015685 6712456 storage/replica_range_lease.go:301  [n1,s1,r279515/1:/Table/53/1/671{89363â€¦-90101â€¦}] mismatch incrementing epoch for {NodeID:3 Epoch:1 Expiration:1538396227.881482031,0 Draining:false Decommissioning:false}; actual is {NodeID:3 Epoch:1 Expiration:1538396232.380737449,0 Draining:false Decommissioning:false}
I181001 12:17:14.630589 6715838 storage/node_liveness.go:769  [n1,s1,r279515/1:/Table/53/1/671{89363â€¦-90101â€¦}] incremented n3 liveness epoch to 2
E181001 12:17:14.630774 6715936 storage/replica_range_lease.go:301  [n1,s1,r279580/1:/Table/53/1/5404{5716â€¦-6453â€¦}] mismatch incrementing epoch for {NodeID:2 Epoch:1 Expiration:1538396232.384500884,0 Draining:false Decommissioning:false}; actual is {NodeID:2 Epoch:1 Expiration:1538396241.384544493,0 Draining:false Decommissioning:false}
E181001 12:17:14.634945 6715945 storage/replica_range_lease.go:301  [n1,s1,r279574/1:/Table/53/1/6020{8760â€¦-9498â€¦}] mismatch incrementing epoch for {NodeID:2 Epoch:1 Expiration:1538396232.384500884,0 Draining:false Decommissioning:false}; actual is {NodeID:2 Epoch:1 Expiration:1538396241.384544493,0 Draining:false Decommissioning:false}
I181001 12:17:15.970915 226 server/status/runtime.go:465  [n1] runtime stats: 9.0 GiB RSS, 849 goroutines, 8.0 GiB/38 MiB/8.7 GiB GO alloc/idle/total, 253 MiB/310 MiB CGO alloc/total, 16810.3 CGO/sec, 296.1/4.2 %(u/s)time, 0.0 %gc (1x), 5.3 MiB/7.7 MiB (r/w)net
I181001 12:17:16.340638 112 storage/replica_proposal.go:211  [n1,s1,r369616/1:/Table/53/1/6568{1898â€¦-2267â€¦}] new range lease repl=(n1,s1):1 seq=9 start=1538396229.352316664,0 epo=1 pro=1538396229.352322551,0 following repl=(n3,s3):3 seq=8 start=1538394902.041204727,0 epo=1 pro=1538394902.041206466,0
W181001 12:17:17.316790 233 storage/node_liveness.go:558  [n1,hb] slow heartbeat took 2.1s
I181001 12:17:20.759340 171 storage/replica_proposal.go:211  [n1,s1,r279515/1:/Table/53/1/671{89363â€¦-90101â€¦}] new range lease repl=(n1,s1):1 seq=11 start=1538396233.018741631,0 epo=1 pro=1538396233.018747811,0 following repl=(n3,s3):3 seq=10 start=1538395358.690700226,0 epo=1 pro=1538395358.690702811,0
I181001 12:17:24.257944 233 storage/node_liveness.go:830  [n1,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (context deadline exceeded)
W181001 12:17:24.257967 233 storage/node_liveness.go:558  [n1,hb] slow heartbeat took 4.5s
W181001 12:17:24.257998 233 storage/node_liveness.go:494  [n1,hb] failed node liveness heartbeat: context deadline exceeded
W181001 12:17:24.597082 155 storage/closedts/provider/provider.go:144  [ct-closer] unable to move closed timestamp forward: not live
I181001 12:17:25.975911 226 server/status/runtime.go:465  [n1] runtime stats: 9.0 GiB RSS, 786 goroutines, 5.6 GiB/1.7 GiB/8.7 GiB GO alloc/idle/total, 261 MiB/319 MiB CGO alloc/total, 3553.7 CGO/sec, 205.8/5.3 %(u/s)time, 0.0 %gc (0x), 4.3 MiB/5.0 MiB (r/w)net
W181001 12:17:26.292736 233 storage/node_liveness.go:558  [n1,hb] slow heartbeat took 2.0s
W181001 12:17:26.292844 6717238 storage/node_liveness.go:558  [n1,s1,r279654/1:/Table/53/1/6568{0792â€¦-1530â€¦}] slow heartbeat took 1.5s
W181001 12:17:26.293065 6717239 storage/node_liveness.go:558  [n1,s1,r279354/1:/Table/53/1/6893{7373â€¦-8111â€¦}] slow heartbeat took 1.5s
W181001 12:17:27.189458 104 gossip/gossip.go:1501  [n1] first range unavailable; trying remaining resolvers
I181001 12:17:27.196781 6717299 gossip/client.go:129  [n1] started gossip client to teamcity-934753-kv-splits-nodes-3-0003:26257
I181001 12:17:28.731630 6717364 gossip/client.go:129  [n1] started gossip client to teamcity-934753-kv-splits-nodes-3-0002:26257
W181001 12:17:33.259255 233 storage/node_liveness.go:558  [n1,hb] slow heartbeat took 4.5s
W181001 12:17:33.259304 233 storage/node_liveness.go:494  [n1,hb] failed node liveness heartbeat: context deadline exceeded
I181001 12:17:35.979246 226 server/status/runtime.go:465  [n1] runtime stats: 9.0 GiB RSS, 776 goroutines, 6.6 GiB/894 MiB/8.7 GiB GO alloc/idle/total, 261 MiB/319 MiB CGO alloc/total, 3676.4 CGO/sec, 201.3/6.1 %(u/s)time, 0.0 %gc (0x), 4.2 MiB/4.4 MiB (r/w)net
W181001 12:17:37.534545 6712147 storage/intent_resolver.go:745  [n1,s1] failed to cleanup transaction intents: failed to resolve intents: the batch experienced mixed success and failure
W181001 12:17:37.760144 233 storage/node_liveness.go:558  [n1,hb] slow heartbeat took 4.5s
W181001 12:17:37.760202 233 storage/node_liveness.go:494  [n1,hb] failed node liveness heartbeat: context deadline exceeded
W181001 12:17:37.946429 6717525 storage/replica.go:3582  [n1,s1,r344398/1:/Table/1{3/1/2018â€¦-4}] context canceled before proposing: 1 RequestLease
W181001 12:17:42.275915 233 storage/node_liveness.go:558  [n1,hb] slow heartbeat took 4.5s
W181001 12:17:42.275966 233 storage/node_liveness.go:494  [n1,hb] failed node liveness heartbeat: context deadline exceeded
I181001 12:17:44.339205 162 storage/replica_proposal.go:211  [n1,s1,r369638/1:/Table/53/1/47923{6229â€¦-9918â€¦}] new range lease repl=(n1,s1):1 seq=8 start=1538396231.705554528,0 epo=1 pro=1538396231.705558107,0 following repl=(n2,s2):2 seq=7 start=1538394902.205830683,0 epo=1 pro=1538394902.205832255,0
W181001 12:17:45.172442 195 storage/store.go:1651  [n1,s1,r1/1:/{Min-System/}] could not gossip first range descriptor: [NotLeaseHolderError] r1: replica (n1,s1):1 not lease holder; lease holder unknown
I181001 12:17:45.982060 226 server/status/runtime.go:465  [n1] runtime stats: 9.3 GiB RSS, 749 goroutines, 7.7 GiB/22 MiB/8.9 GiB GO alloc/idle/total, 262 MiB/319 MiB CGO alloc/total, 5135.8 CGO/sec, 284.1/9.7 %(u/s)time, 0.0 %gc (1x), 4.1 MiB/4.2 MiB (r/w)net
W181001 12:17:46.776957 233 storage/node_liveness.go:558  [n1,hb] slow heartbeat took 4.5s
W181001 12:17:46.776993 233 storage/node_liveness.go:494  [n1,hb] failed node liveness heartbeat: context deadline exceeded
W181001 12:17:51.277402 233 storage/node_liveness.go:558  [n1,hb] slow heartbeat took 4.5s
W181001 12:17:51.277434 233 storage/node_liveness.go:494  [n1,hb] failed node liveness heartbeat: context deadline exceeded
W181001 12:17:54.255791 202 storage/store_rebalancer.go:212  [n1,s1,store-rebalancer] StorePool missing descriptor for local store
I181001 12:17:54.257530 62 gossip/gossip.go:555  [n1] gossip status (stalled, 3 nodes)
gossip client (2/3 cur/max conns)
  2: teamcity-934753-kv-splits-nodes-3-0003:26257 (27s: infos 66/81 sent/received, bytes 23548B/12564B sent/received)
  3: teamcity-934753-kv-splits-nodes-3-0002:26257 (26s: infos 90/82 sent/received, bytes 26286B/16260B sent/received)
gossip server (0/3 cur/max conns, infos 388835/215832 sent/received, bytes 183546738B/110078836B sent/received)
gossip connectivity
  n1 -> n2; n1 -> n3; n2 -> n3;
W181001 12:17:55.420554 195 storage/store.go:1651  [n1,s1,r1/1:/{Min-System/}] could not gossip first range descriptor: [NotLeaseHolderError] r1: replica (n1,s1):1 not lease holder; lease holder unknown
W181001 12:17:55.778463 233 storage/node_liveness.go:558  [n1,hb] slow heartbeat took 4.5s

After that point, all three nodes began having serious trouble with their liveness updates which persists for the rest of the test:

W181001 12:18:07.495467 6717420 storage/node_liveness.go:558  [n1,s1,r279203/1:/Table/53/1/6421{1696â€¦-2434â€¦}] slow heartbeat took 34.7s
E181001 12:18:07.495479 6717420 storage/replica_range_lease.go:287  [n1,s1,r279203/1:/Table/53/1/6421{1696â€¦-2434â€¦}] heartbeat failed on epoch increment
W181001 12:18:07.495616 6717421 storage/node_liveness.go:558  [n1,s1,r279134/1:/Table/53/1/64220{1819â€¦-9197â€¦}] slow heartbeat took 34.6s
E181001 12:18:07.495624 6717421 storage/replica_range_lease.go:287  [n1,s1,r279134/1:/Table/53/1/64220{1819â€¦-9197â€¦}] heartbeat failed on epoch increment
W181001 12:18:07.495754 6717383 storage/node_liveness.go:558  [n1,s1,r279532/1:/Table/53/1/629{19320â€¦-20057â€¦}] slow heartbeat took 34.6s
E181001 12:18:07.495765 6717383 storage/replica_range_lease.go:287  [n1,s1,r279532/1:/Table/53/1/629{19320â€¦-20057â€¦}] heartbeat failed on epoch increment

The interesting thing we see in those logs is that gossip appears to have issues at the same time that node liveness grinds to a halt.

One thing that catches my eye is that all three nodes log first range unavailable; trying remaining resolvers at the same time and then get stuck logging first range unavailable; resolvers exhausted for the rest of the test. @petermattis do you know anything about this? What does it mean for gossip to be stalled, and what effects do we expect this to have on a cluster?

petermattis commented 6 years ago

Gossip is stalled (for logging purposes) if a node does not have the sentinel key. The sentinel key now expires quite rapidly so that a gossip partition will be repaired quickly. I'm not familiar with that first range unavailable message. Looking right now, it seems to be indicating that the sentinel key is not present and has nothing to do with the first range. The sentinel key is gossiped by the leaseholder of the first range, so the log message might be a historic artifact. If there is no leaseholder for the first range I can imagine all of the nodes simultaneously deciding the gossip is stalled.

nvanbenschoten commented 6 years ago

Got it, thanks for the explanation.

I was just able to reproduce this after stressing it on three clusters for a few hours, so it's not terribly rare.

nvanbenschoten commented 6 years ago

The split attempts are all stuck on:

github.com/cockroachdb/cockroach/pkg/storage/idalloc.(*Allocator).Allocate(0xc4204b84d0, 0x3094bc0, 0xc4febbbbc0, 0x3094bc0, 0xc4febbbbc0, 0x100)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/idalloc/id_alloc.go:87 +0x129
github.com/cockroachdb/cockroach/pkg/storage.(*Store).NewRangeDescriptor(0xc4202a5b80, 0x3094bc0, 0xc4febbbbc0, 0xc5373bd6a0, 0xb, 0x10, 0xc514c5d050, 0xb, 0x10, 0xc534f6d410, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2318 +0x4a
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).adminSplitWithDescriptor(0xc49ba95500, 0x3094bc0, 0xc4febbbbc0, 0xc5373bd680, 0xb, 0x10, 0x0, 0x0, 0x0, 0x0, ...)

nvanbenschoten commented 6 years ago

Which is waiting on idalloc/id_alloc.go:112.

nvanbenschoten commented 6 years ago

The idalloc.Allocator's block allocation is stuck in the command queue on r4. I can see logs like:

I181002 02:21:41.329185 2669993 storage/replica.go:2370  [n2,s2,r4/3:/System/{NodeLive…-tsd}] waiting for 1 overlapping requests: {write/global: [1 Inc]}

so one of the increments must be stuck beneath the command queue.

r4 contains the key range [/System/NodeLivenessMax, /System/tsd) and currently has an invalid lease. The range starts after the expiration-based lease cut-off because its start key is not less than /System/NodeLivenessMax, so it is presumably trying to acquire an epoch-based lease, which requires a healthy liveness range.

nvanbenschoten commented 6 years ago

n2 (the leaseholder of r4) just crashed with an OOM error after I poked it a little too agressively with logspy. Hopefully there's some more we can learn from its heap dumps.

nvanbenschoten commented 6 years ago

As suspected, the lease request to acquire the lease for r4 was stuck at https://github.com/cockroachdb/cockroach/blob/0f5a289d49105cd317a2713b71546d84c25547f6/pkg/storage/replica_range_lease.go#L290

nvanbenschoten commented 6 years ago

The memory blowup was in the statusServer. It had some call to JSONPb.Marshal. I'm going to assume that I caused that by poking at the admin ui.

nvanbenschoten commented 6 years ago

I traced the requestLeaseAsync's call into (*NodeLiveness).Heartbeat and up to beginCmds. Luckily it looks like the NodeLiveness range is also on n2. This gave me the following stack-frame:

github.com/cockroachdb/cockroach/pkg/storage.(*Replica).beginCmds(0xc42022e300, 0x3094bc0, 0xc551eed350, 0xc5ec037b80, 0xc4bee6aba0, 0xb, 0x10, 0xc425355320)

The first argument here is the NodeLiveness Replica's memory address.

Searching for other stack frames with this argument reveals that NodeLiveness itself is struggling to acquire a lease (this time, expiration-based). At least one tryExecuteWriteBatch call on the range is blocked in redirectOnOrAcquireLease.

I also see a call to getLeaseForGossip, which is stuck in the same place. That call is starting from https://github.com/cockroachdb/cockroach/blob/0f5a289d49105cd317a2713b71546d84c25547f6/pkg/storage/store.go#L1624-L1629

nvanbenschoten commented 6 years ago

Aha! I also see the stuck NodeLiveness lease request, which has been waiting on Raft for 6 mintues:

goroutine 2672554 [select, 6 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).tryExecuteWriteBatch(0xc42022e300, 0x3094b00, 0xc76b15a600, 0x1559a98647992a2f, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3258 +0x961
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).executeWriteBatch(0xc42022e300, 0x3094b00, 0xc76b15a600, 0x1559a98647992a2f, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3075 +0xab
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc42022e300, 0x3094b00, 0xc76b15a600, 0x3, 0x1559a98647992a2f, 0x0, 0x0, 0x0, 0x3, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:2024 +0x6f3
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc42022e300, 0x3094b00, 0xc76b15a600, 0x1559a98647992a2f, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1972 +0x90
github.com/cockroachdb/cockroach/pkg/storage.(*pendingLeaseRequest).requestLeaseAsync.func2(0x3094b00, 0xc76b15a600)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_range_lease.go:321 +0x5c1

So why is this stuck?

nvanbenschoten commented 6 years ago

I think I'm going to need another live reproduction for this. I did find some interesting logs on n2 though:

W181002 02:25:38.275112 177 storage/replica.go:1553  [n2,s2,r3/2:/System/NodeLiveness{-Max}] have been waiting 1m0s attempting to acquire lease
W181002 02:25:38.275167 2671869 storage/replica.go:3282  [n2,s2,r3/2:/System/NodeLiveness{-Max}] have been waiting 1m0s for proposing command RequestLease [/System/NodeLiveness,/Min)
I181002 02:25:41.846915 2671869 storage/replica.go:3291  [n2,s2,r3/2:/System/NodeLiveness{-Max}] slow command RequestLease [/System/NodeLiveness,/Min) finished after 1m3.604019832s
I181002 02:25:41.847052 177 storage/replica.go:1558  [n2,s2,r3/2:/System/NodeLiveness{-Max}] slow lease acquisition finished after 1m3.604430509s with error [NotLeaseHolderError] r3: replica (n2,s2):2 not lease holder; lease holder unknown after 1 attempts
W181002 02:25:41.847083 177 storage/store.go:1651  [n2,s2,r3/2:/System/NodeLiveness{-Max}] could not gossip node liveness: [NotLeaseHolderError] r3: replica (n2,s2):2 not lease holder; lease holder unknown
W181002 02:28:19.422663 2672554 storage/replica.go:3282  [n2,s2,r3/2:/System/NodeLiveness{-Max}] have been waiting 1m0s for proposing command RequestLease [/System/NodeLiveness,/Min)
W181002 02:28:47.277546 177 storage/replica.go:1553  [n2,s2,r3/2:/System/NodeLiveness{-Max}] have been waiting 1m0s attempting to acquire lease

nvanbenschoten commented 6 years ago

I was able to reproduce this again and tracked the issue up to problems with lease acquisition in r1.

All three replicas have pending commands and keep logging I181002 05:37:52.204846 151 storage/replica.go:4628 [n2,s2,r1/3:/M{in-eta2/Table/…}] not quiescing: 1 pending commands. All three replicas also keep proposing a command with 1 RequestLease.

The interesting thing is that I see leases continually getting rejected with errors like:

I181002 05:37:52.216711 151 storage/replica.go:5205  [n2,s2,r1/3:/M{in-eta2/Table/…}] command proposed from replica (n1,s1):1 with lease #3 incompatible to repl=(n2,s2):3 seq=3 start=1538448219.495057052,0 exp=1538458659.106336434,0 pro=1538458650.106341768,0
I181002 05:37:52.216740 151 storage/replica.go:5345  [n2,s2,r1/3:/M{in-eta2/Table/…}] applying command with forced error: cannot replace lease repl=(n2,s2):3 seq=3 start=1538448219.495057052,0 exp=1538458659.106336434,0 pro=1538458650.106341768,0 with repl=(n1,s1):1 seq=4 start=1538458625.064248657,1 exp=1538458647.320924838,0 pro=1538458638.320939183,0: proposed under invalid lease

and

I181002 05:38:01.506964 49 storage/replica.go:5205  [n2,s2,r1/3:/M{in-eta2/Table/…}] command proposed from replica (n3,s3):2 with lease #3 incompatible to repl=(n2,s2):3 seq=3 start=1538448219.495057052,0 exp=1538458659.106336434,0 pro=1538458650.106341768,0
I181002 05:38:01.506993 49 storage/replica.go:5345  [n2,s2,r1/3:/M{in-eta2/Table/…}] applying command with forced error: cannot replace lease repl=(n2,s2):3 seq=3 start=1538448219.495057052,0 exp=1538458659.106336434,0 pro=1538458650.106341768,0 with repl=(n3,s3):2 seq=4 start=1538458640.984601682,1 exp=1538458664.287062629,0 pro=1538458655.287069407,0: proposed under invalid lease

I'm starting to see a lot of similarities between this issue and #30064. I think there's actually a good chance they're the same. What do you think @tschottdorf?

From what I can tell, this must be running into the check added in 5c7b7705f90a17469c71dada2d2dc155bac92978. That's the only way for leaseMismatch to be true when the proposer lease and the current lease have seq num 3 and the proposed lease has seq num 4. This check should only protect against concurrent lease extensions, it should fire indefinitely, so something seems off with that.

I'm also confused where the RequestLease proposal from n2 is going. It certainly shouldn't have the wrong previous lease proposal ts since n2 was the owner of the previous lease. Perhaps two different things are going wrong and only their combination causes this full stall.

nvanbenschoten commented 6 years ago

Never mind, I actually do see n2's lease extension get applied:

exp=1538460769.440352196,0 pro=1538460760.440376406,0 following repl=(n2,s2):3 seq=3 start=1538448219.495057052,0 exp=1538460762.688870621,0 pro=1538460753.688877340,0
I181002 06:13:11.121996 158 storage/replica.go:6623  [n2,s2,r1/3:/M{in-eta2/Table/…}] not gossiping system config because the replica doesn't contain the system config's start key

But new lease requests are proposed on all three replicas immediately afterwards.

nvanbenschoten commented 6 years ago

Perhaps the proposal was so slow that the lease is expired by the time it's applied.

tbg commented 6 years ago

I'm looking at logspy to confirm this right now, but I get the impression that you're right and that these lease errors are a red herring. n2 holds the lease steadily and n1s and n3s proposals always seem to be proposed under a past incarnation of n2s lease. The real question is why it takes so long to get through a Raft cycle.

tbg commented 6 years ago

For example, logspying via

http://n2:26258/debug/logspy?count=1000&duration=600s&grep=r1/.*evaluated%20RequestLease|r1/.*forced%20error

we can see the following two lease extensions (note how the proposed_ts of the first matches the proposed ts of the prev_lease of the second line):

I181002 11:58:34.182316 evaluated RequestLease command header:<> lease:<start: expiration: replica: proposed_ts: > prev_lease:<start: expiration: replica: deprecated_start_stasis: proposed_ts: sequence:3 > min_proposed_ts:<> : header:<> , err= I181002 11:58:49.947290 3845851 evaluated RequestLease command header:<> lease:<start: expiration: replica: proposed_ts: > prev_lease:<start: expiration: replica: deprecated_start_stasis: proposed_ts: sequence:3 > min_proposed_ts:<> : header:<> , err=

There are 15 seconds between these events. We expect expiration-based leases to be renewed more frequently, but here in these logs I see much larger gaps. For example, here are two adjacent denied lease requests from n3:

I181002 11:59:16.376303 156 storage/replica.go:5281 [n2,s2,r1/3:/M{in-eta2/Table/…}] applying command with forced error: cannot replace lease repl=(n2,s2):3 seq=3 start=1538441345.643284120,0 exp=1538481545.972658550,0 pro=1538481536.972699798,0 with repl=(n3,s3):2 seq=4 start=1538481523.181941950,1 exp=1538481541.262538763,0 pro=1538481532.262558812,0: proposed under invalid lease I181002 11:59:52.427325 110 storage/replica.go:5281 [n2,s2,r1/3:/M{in-eta2/Table/…}] applying command with forced error: cannot replace lease repl=(n2,s2):3 seq=3 start=1538441345.643284120,0 exp=1538481570.394841780,0 pro=1538481561.394865011,0 with repl=(n3,s3):2 seq=4 start=1538481553.663359371,1 exp=1538481573.730669100,0 pro=1538481564.730699065,0: proposed under invalid lease

That's a 46s gap. So what's making Raft so slow? You could guess that something's busy-looping and really overloading Raft itself, but at least on r1 that's not the case. The Readys emitted for r1 are sometimes 15s apart! It's very rare to see two close to each other. On the other hand, we're not seeing any of the signs of slow disk I/O. The CPU profile looks as expected (lots of ticking and stats looping and when GC runs, it's expensive). The process eats ~300% CPUs (on a 4cpu node).

Could it be as simple as

the node slowing down to the point where a 4s interval is never enough to make it through Raft in r1 (thus sending all
a request directed to a follower replica of r1 tries to send a lease request (since its Raft state lags so the current lease appears expired)
the lease request comes back as rejected, but current lease looks stale again
infinite retry: the request never emerges from redirectOnOrAcquireLease, and if it happens to be an ID allocation, it blocks everything.

tbg commented 6 years ago

I'm not a blocking-profile-expert, but is this interesting?

If the raft scheduler contended over its mutex, that could explain why Raft runs very slowly.

tbg commented 6 years ago

Here's a flamer (not from the same profile):

tbg commented 6 years ago

maybe there's some pathological interaction where sometimes all ranges unquiesce (where normally almost all of them are quiesced) and once that happens, the scheduler gets horked up and r1 shoots itself (and so nothing ever quiesces again due to outstanding proposals)

tbg commented 6 years ago

Here's more evidence (?), the runtime scheduler profile:

PS I'm starting to think that we should automatically download a runtime trace on certain events.

Close-up of the runtime trace view:

TL;DR (you can't see this from the pic, but trace is attached) all I see is Raft scheduler.

trace.raw.txt

tbg commented 6 years ago

(I'm not sure that I'm reading all of this correctly because that scheduler mutex is used for a sync.Cond).

tbg commented 6 years ago

Just a heads up that the repro just got destroyed (timed out) @nvanbenschoten

tbg commented 6 years ago

Do you still think this is https://github.com/cockroachdb/cockroach/issues/30064? Because it doesn't seem like it to me (or at least nothing I suspect above would apply to that issue).

nvanbenschoten commented 6 years ago

maybe there's some pathological interaction where sometimes all ranges unquiesce (where normally almost all of them are quiesced) and once that happens, the scheduler gets horked up and r1 shoots itself (and so nothing ever quiesces again due to outstanding proposals)

That was my conclusion as well. I saw stack-traces, logs, and mutex profiles showing that a large number of Ranges were suddenly unquiescing and attempting to quiesce again. I think it's possible that on a 4-core machine, 500,000 unquiescing ranges overloads everything to the point where r1 and node liveness fall over. Once these two ranges get stuck, nothing can quiesce and the cluster grinds to a halt.

We should try running this test with quiescence disabled from the beginning and see how far we get.

nvanbenschoten commented 6 years ago

Just found a comment that I meant to post last night after https://github.com/cockroachdb/cockroach/issues/30832#issuecomment-426162422.

That seems likely. The period from [n2] submitting proposal <leaseReqID> to [n2] processing command <leaseReqID> is hovering around 19 seconds. This lines up with the duration of liveness heartbeats we see in the logs.

tbg commented 6 years ago

(renaming this issue in an attempt to lure the issue poster to put both the quiesce={true,false} failures here.

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/5b5738084b8cdc769d5e7973921de5cae4457380

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=995412&tab=buildLog

The test failed on master:
    test.go:639,cluster.go:1118,kv.go:192,cluster.go:1440,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-995412-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        239 640 workload/workload.go:463  finished 441000 of 500000 splits
        I181031 15:10:35.914756 640 workload/workload.go:463  finished 442000 of 500000 splits
        I181031 15:10:47.660889 640 workload/workload.go:463  finished 443000 of 500000 splits
        I181031 15:10:58.764507 640 workload/workload.go:463  finished 444000 of 500000 splits
        I181031 15:11:12.917784 640 workload/workload.go:463  finished 445000 of 500000 splits
        I181031 15:11:26.294120 640 workload/workload.go:463  finished 446000 of 500000 splits
        I181031 15:11:37.421608 640 workload/workload.go:463  finished 447000 of 500000 splits
        I181031 15:11:52.721721 640 workload/workload.go:463  finished 448000 of 500000 splits
        I181031 15:12:04.125430 640 workload/workload.go:463  finished 449000 of 500000 splits
        I181031 15:12:15.119790 640 workload/workload.go:463  finished 450000 of 500000 splits
        I181031 15:12:32.175328 640 workload/workload.go:463  finished 451000 of 500000 splits
        I181031 15:12:45.589561 640 workload/workload.go:463  finished 452000 of 500000 splits
        : signal: interrupt
    test.go:639,cluster.go:1461,kv.go:195: signal: interrupt

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/b3ba03fdb86cbefceb42f46069ffd685749aa7b0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1008034&tab=buildLog

The test failed on master:
    test.go:1037: test timed out (2h0m0s)
    test.go:639,cluster.go:1118,kv.go:201,cluster.go:1440,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-1008034-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        04:28:29.137290 28 workload/workload.go:463  finished 468000 of 500000 splits
        I181113 04:28:44.157813 28 workload/workload.go:463  finished 469000 of 500000 splits
        I181113 04:28:57.848084 28 workload/workload.go:463  finished 470000 of 500000 splits
        I181113 04:29:16.958062 28 workload/workload.go:463  finished 471000 of 500000 splits
        I181113 04:29:25.121726 28 workload/workload.go:463  finished 472000 of 500000 splits
        I181113 04:29:37.339797 28 workload/workload.go:463  finished 473000 of 500000 splits
        I181113 04:30:01.301964 28 workload/workload.go:463  finished 474000 of 500000 splits
        I181113 04:30:08.987929 28 workload/workload.go:463  finished 475000 of 500000 splits
        I181113 04:30:21.277586 28 workload/workload.go:463  finished 476000 of 500000 splits
        I181113 04:30:39.053483 28 workload/workload.go:463  finished 477000 of 500000 splits
        I181113 04:30:47.307021 28 workload/workload.go:463  finished 478000 of 500000 splits
        I181113 04:30:57.995728 28 workload/workload.go:463  finished 479000 of 500000 splits
        : signal: killed
    test.go:639,cluster.go:1461,kv.go:204: Goexit() was called

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/b01468f592f6b12ba147e1c19ea6d8ad7484ff6e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1007980&tab=buildLog

The test failed on release-2.1:
    test.go:1037: test timed out (2h0m0s)
    test.go:639,cluster.go:1118,kv.go:201,cluster.go:1440,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-1007980-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        900 629 workload/workload.go:463  finished 206000 of 500000 splits
        I181113 04:05:33.955608 629 workload/workload.go:463  finished 207000 of 500000 splits
        I181113 04:12:27.416414 629 workload/workload.go:463  finished 208000 of 500000 splits
        I181113 04:19:15.083778 629 workload/workload.go:463  finished 209000 of 500000 splits
        I181113 04:26:22.968465 629 workload/workload.go:463  finished 210000 of 500000 splits
        I181113 04:33:31.326854 629 workload/workload.go:463  finished 211000 of 500000 splits
        I181113 04:41:40.248233 629 workload/workload.go:463  finished 212000 of 500000 splits
        I181113 04:51:57.001044 629 workload/workload.go:463  finished 213000 of 500000 splits
        I181113 05:02:17.993776 629 workload/workload.go:463  finished 214000 of 500000 splits
        I181113 05:09:56.116334 629 workload/workload.go:463  finished 215000 of 500000 splits
        I181113 05:19:28.592110 629 workload/workload.go:463  finished 216000 of 500000 splits
        I181113 05:27:56.093064 629 workload/workload.go:463  finished 217000 of 500000 splits
        : signal: killed
    test.go:639,cluster.go:1461,kv.go:204: Goexit() was called

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/704e5b8e9619a5015fa4ccbd13de2e5353221ca5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1019281&tab=buildLog

The test failed on master:
    test.go:1047: test timed out (2h0m0s)
    test.go:645,cluster.go:1141,kv.go:198,cluster.go:1463,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-1019281-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        132 627 workload/workload.go:463  finished 431000 of 500000 splits
        I181120 13:36:44.895418 627 workload/workload.go:463  finished 432000 of 500000 splits
        I181120 13:36:59.016866 627 workload/workload.go:463  finished 433000 of 500000 splits
        I181120 13:37:10.646633 627 workload/workload.go:463  finished 434000 of 500000 splits
        I181120 13:37:27.148906 627 workload/workload.go:463  finished 435000 of 500000 splits
        I181120 13:37:33.490494 627 workload/workload.go:463  finished 436000 of 500000 splits
        I181120 13:37:50.159046 627 workload/workload.go:463  finished 437000 of 500000 splits
        I181120 13:38:02.879321 627 workload/workload.go:463  finished 438000 of 500000 splits
        I181120 13:38:19.253057 627 workload/workload.go:463  finished 439000 of 500000 splits
        I181120 13:38:27.973268 627 workload/workload.go:463  finished 440000 of 500000 splits
        I181120 13:38:46.073119 627 workload/workload.go:463  finished 441000 of 500000 splits
        I181120 13:38:59.876803 627 workload/workload.go:463  finished 442000 of 500000 splits
        : signal: killed
    test.go:645,cluster.go:1484,kv.go:201: Goexit() was called

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/cfd53f62f038a06e34570f11ec02bc74dabd2aab

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1021046&tab=buildLog

The test failed on master:
    test.go:1047: test timed out (2h0m0s)
    test.go:645,cluster.go:1141,kv.go:198,cluster.go:1463,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-1021046-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        733 636 workload/workload.go:463  finished 488000 of 500000 splits
        I181121 14:14:41.344236 636 workload/workload.go:463  finished 489000 of 500000 splits
        I181121 14:14:53.093100 636 workload/workload.go:463  finished 490000 of 500000 splits
        I181121 14:15:21.530784 636 workload/workload.go:463  finished 491000 of 500000 splits
        I181121 14:15:32.824132 636 workload/workload.go:463  finished 492000 of 500000 splits
        I181121 14:15:48.204274 636 workload/workload.go:463  finished 493000 of 500000 splits
        I181121 14:16:00.406519 636 workload/workload.go:463  finished 494000 of 500000 splits
        I181121 14:16:21.136459 636 workload/workload.go:463  finished 495000 of 500000 splits
        I181121 14:16:39.268036 636 workload/workload.go:463  finished 496000 of 500000 splits
        I181121 14:16:54.834918 636 workload/workload.go:463  finished 497000 of 500000 splits
        I181121 14:17:19.009323 636 workload/workload.go:463  finished 498000 of 500000 splits
        I181121 14:17:37.081012 636 workload/workload.go:463  finished 499000 of 500000 splits
        : signal: killed
    test.go:645,cluster.go:1484,kv.go:201: Goexit() was called

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/96f03c95d5078ebad7167c5cdb145e365978a008

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1022695&tab=buildLog

The test failed on master:
    test.go:645,cluster.go:1141,kv.go:198,cluster.go:1463,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-1022695-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        dial tcp 10.128.0.2:26257: connect: connection refused
        W181122 14:49:36.543751 295 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (-4417119374443224064) TO (-4417119374443224064): dial tcp 10.128.0.2:26257: connect: connection refused
        W181122 14:49:36.543805 470 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (800642431746353152) TO (800642431746353152): dial tcp 10.128.0.2:26257: connect: connection refused
        W181122 14:49:36.543872 542 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (3615609947459397632) TO (3615609947459397632): dial tcp 10.128.0.2:26257: connect: connection refused
        W181122 14:49:36.543888 571 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (-7022126469030150144) TO (-7022126469030150144): read tcp 10.128.0.3:37898->10.128.0.2:26257: read: connection reset by peer
        W181122 14:49:36.543920 447 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (1556219557851242496) TO (1556219557851242496): read tcp 10.128.0.3:37896->10.128.0.2:26257: read: connection reset by peer
        : signal: killed
    test.go:645,cluster.go:1484,kv.go:201: unexpected node event: 1: dead

tbg commented 6 years ago

^- OOM


fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x2d17a55, 0x16)
    /usr/local/go/src/runtime/panic.go:616 +0x81
runtime.sysMap(0xc772360000, 0x2ab0000, 0x0, 0x4ab8c18)
    /usr/local/go/src/runtime/mem_linux.go:216 +0x20a
runtime.(*mheap).sysAlloc(0x4a76ba0, 0x2ab0000, 0x179d6ab)
    /usr/local/go/src/runtime/malloc.go:470 +0xd4
runtime.(*mheap).grow(0x4a76ba0, 0x1552, 0x0)

proffie.961.inuse_objects.pdf proffie.961.inuse_space.pdf proffie.961.alloc_objects.pdf proffie.961.alloc_space.pdf

inuse_objects looks commandqueue-heavy. Is this new? cc @nvanbenschoten

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/93632938e9ea7d86a56ec8423f2eff66c30965d5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1023971&tab=buildLog

The test failed on master:
    test.go:645,cluster.go:1141,kv.go:198,cluster.go:1463,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-1023971-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        019203969024) TO (6143436019203969024): dial tcp 10.128.0.4:26257: connect: connection refused
        W181123 14:50:34.618801 597 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (-2148580218824882176) TO (-2148580218824882176): dial tcp 10.128.0.4:26257: connect: connection refused
        W181123 14:50:34.618820 384 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (-2136184031599723520) TO (-2136184031599723520): dial tcp 10.128.0.4:26257: connect: connection refused
        W181123 14:50:34.618876 522 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (6143067085060364288) TO (6143067085060364288): dial tcp 10.128.0.4:26257: connect: connection refused
        W181123 14:50:35.618441 473 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (6133585477569691648) TO (6133585477569691648): dial tcp 10.128.0.4:26257: connect: connection refused
        W181123 14:50:35.618472 402 workload/workload.go:434  ALTER TABLE kv SCATTER FROM (4776387443486654464) TO (4776387443486654464): dial tcp 10.128.0.4:26257: connect: connection refused
        : signal: killed
    test.go:645,cluster.go:1484,kv.go:201: unexpected node event: 2: dead

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/1bcb46e7bf7c03e95128f44de8f24dbdc6a231f0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1024654&tab=buildLog

The test failed on master:
    test.go:1047: test timed out (2h0m0s)
    test.go:645,cluster.go:1141,kv.go:198,cluster.go:1463,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-1024654-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        895 625 workload/workload.go:463  finished 483000 of 500000 splits
        I181124 14:14:25.495096 625 workload/workload.go:463  finished 484000 of 500000 splits
        I181124 14:14:40.446652 625 workload/workload.go:463  finished 485000 of 500000 splits
        I181124 14:14:50.594225 625 workload/workload.go:463  finished 486000 of 500000 splits
        I181124 14:15:14.654771 625 workload/workload.go:463  finished 487000 of 500000 splits
        I181124 14:15:28.910771 625 workload/workload.go:463  finished 488000 of 500000 splits
        I181124 14:15:46.816859 625 workload/workload.go:463  finished 489000 of 500000 splits
        I181124 14:16:00.372353 625 workload/workload.go:463  finished 490000 of 500000 splits
        I181124 14:16:17.759874 625 workload/workload.go:463  finished 491000 of 500000 splits
        I181124 14:16:28.655378 625 workload/workload.go:463  finished 492000 of 500000 splits
        I181124 14:16:45.667370 625 workload/workload.go:463  finished 493000 of 500000 splits
        I181124 14:17:02.666340 625 workload/workload.go:463  finished 494000 of 500000 splits
        : signal: killed
    test.go:645,cluster.go:1484,kv.go:201: Goexit() was called

nvanbenschoten commented 6 years ago

inuse_objects looks commandqueue-heavy. Is this new? cc @nvanbenschoten

Yes that is new. It's almost certainly because of https://github.com/cockroachdb/cockroach/commit/5b1ad6864b61707a7f12f317b12a93d589dbbf4d#diff-c3bd9e9a650f348ddf8612d19debfe87R239, which wasn't intended to go into that change. I don't know of any issues with the btree implementation, but it clearly has a larger memory footprint than the llrb one. It should be resolved with #32584.

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/07edde23460e8ffe0ec40f89975c3b95fc28343e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1030507&tab=buildLog

The test failed on master:
    test.go:1047: test timed out (2h0m0s)
    test.go:645,cluster.go:1141,kv.go:198,cluster.go:1463,errgroup.go:58: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1030507-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        537 621 workload/workload.go:463  finished 382000 of 500000 splits
        I181129 13:40:50.245870 621 workload/workload.go:463  finished 383000 of 500000 splits
        I181129 13:41:00.836395 621 workload/workload.go:463  finished 384000 of 500000 splits
        I181129 13:41:09.417963 621 workload/workload.go:463  finished 385000 of 500000 splits
        I181129 13:41:18.468417 621 workload/workload.go:463  finished 386000 of 500000 splits
        I181129 13:41:30.203088 621 workload/workload.go:463  finished 387000 of 500000 splits
        I181129 13:41:40.284280 621 workload/workload.go:463  finished 388000 of 500000 splits
        I181129 13:41:46.228165 621 workload/workload.go:463  finished 389000 of 500000 splits
        I181129 13:41:58.007790 621 workload/workload.go:463  finished 390000 of 500000 splits
        I181129 13:42:11.187337 621 workload/workload.go:463  finished 391000 of 500000 splits
        I181129 13:42:21.038254 621 workload/workload.go:463  finished 392000 of 500000 splits
        I181129 13:42:30.125442 621 workload/workload.go:463  finished 393000 of 500000 splits
        : signal: killed
    test.go:645,cluster.go:1484,kv.go:201: Goexit() was called

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/6fb1b0037ead768572b49646e0ffbe73013a8f9a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1032776&tab=buildLog

The test failed on master:
    test.go:1047: test timed out (2h0m0s)
    test.go:645,cluster.go:1141,kv.go:198,cluster.go:1463,errgroup.go:58: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1032776-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        614 636 workload/workload.go:463  finished 487000 of 500000 splits
        I181130 13:43:03.287880 636 workload/workload.go:463  finished 488000 of 500000 splits
        I181130 13:43:16.647099 636 workload/workload.go:463  finished 489000 of 500000 splits
        I181130 13:43:31.210421 636 workload/workload.go:463  finished 490000 of 500000 splits
        I181130 13:43:40.785362 636 workload/workload.go:463  finished 491000 of 500000 splits
        I181130 13:43:56.083281 636 workload/workload.go:463  finished 492000 of 500000 splits
        I181130 13:44:10.080936 636 workload/workload.go:463  finished 493000 of 500000 splits
        I181130 13:44:19.808364 636 workload/workload.go:463  finished 494000 of 500000 splits
        I181130 13:44:32.896721 636 workload/workload.go:463  finished 495000 of 500000 splits
        I181130 13:44:47.734149 636 workload/workload.go:463  finished 496000 of 500000 splits
        I181130 13:45:00.992474 636 workload/workload.go:463  finished 497000 of 500000 splits
        I181130 13:45:07.990652 636 workload/workload.go:463  finished 498000 of 500000 splits
        : signal: killed
    test.go:645,cluster.go:1484,kv.go:201: Goexit() was called

tbg commented 6 years ago

I was planning to change the tests to run less aggressive range counts, but I'll hold off for a few days to see if @ajwerner's change #30152 gives us enough of an edge to reliably pass the current configuration. Who knows!

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/b775ca0318125c996305e4ab560e75d2d3471547

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1036381&tab=buildLog

The test failed on master:
    test.go:1047: test timed out (2h0m0s)
    test.go:645,cluster.go:1141,kv.go:198,cluster.go:1467,errgroup.go:58: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1036381-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        633 639 workload/workload.go:463  finished 471000 of 500000 splits
        I181203 15:29:51.595556 639 workload/workload.go:463  finished 472000 of 500000 splits
        I181203 15:30:03.886793 639 workload/workload.go:463  finished 473000 of 500000 splits
        I181203 15:30:18.855222 639 workload/workload.go:463  finished 474000 of 500000 splits
        I181203 15:30:32.532718 639 workload/workload.go:463  finished 475000 of 500000 splits
        I181203 15:30:43.161972 639 workload/workload.go:463  finished 476000 of 500000 splits
        I181203 15:30:59.892013 639 workload/workload.go:463  finished 477000 of 500000 splits
        I181203 15:31:10.946592 639 workload/workload.go:463  finished 478000 of 500000 splits
        I181203 15:31:24.650485 639 workload/workload.go:463  finished 479000 of 500000 splits
        I181203 15:31:44.063639 639 workload/workload.go:463  finished 480000 of 500000 splits
        I181203 15:31:53.899373 639 workload/workload.go:463  finished 481000 of 500000 splits
        I181203 15:32:06.534286 639 workload/workload.go:463  finished 482000 of 500000 splits
        : signal: killed
    test.go:645,cluster.go:1488,kv.go:201: Goexit() was called

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/e6cb0c5c329617b560eee37527248171b5e06382

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1038478&tab=buildLog

The test failed on master:
    test.go:630,cluster.go:1141,kv.go:195,cluster.go:1467,errgroup.go:58: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1038478-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        go:463  finished 231000 of 500000 splits
        I181204 13:52:44.148191 628 workload/workload.go:463  finished 232000 of 500000 splits
        I181204 13:52:52.271181 628 workload/workload.go:463  finished 233000 of 500000 splits
        I181204 13:52:57.714371 628 workload/workload.go:463  finished 234000 of 500000 splits
        I181204 13:53:06.658652 628 workload/workload.go:463  finished 235000 of 500000 splits
        I181204 13:53:15.013070 628 workload/workload.go:463  finished 236000 of 500000 splits
        I181204 13:53:21.813337 628 workload/workload.go:463  finished 237000 of 500000 splits
        I181204 13:53:30.632485 628 workload/workload.go:463  finished 238000 of 500000 splits
        I181204 13:53:38.783388 628 workload/workload.go:463  finished 239000 of 500000 splits
        I181204 13:53:45.627310 628 workload/workload.go:463  finished 240000 of 500000 splits
        Error: ALTER TABLE kv SPLIT AT VALUES (-8061045017424375808): pq: split at key /Table/53/1/-8061045017424375808 failed: aborted while in command queue: context deadline exceeded
        Error:  exit status 1
        : exit status 1
    test.go:630,cluster.go:1488,kv.go:198: Goexit() was called

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/06d2222fd9010f01a8cdf6a6c24597bbed181f36

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1044826&tab=buildLog

The test failed on master:
    test.go:1025: test timed out (2h0m0s)
    test.go:630,cluster.go:1139,kv.go:195,cluster.go:1465,errgroup.go:58: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1044826-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        180 636 workload/workload.go:463  finished 372000 of 500000 splits
        I181207 14:23:49.965782 636 workload/workload.go:463  finished 373000 of 500000 splits
        I181207 14:23:59.201443 636 workload/workload.go:463  finished 374000 of 500000 splits
        I181207 14:24:06.735991 636 workload/workload.go:463  finished 375000 of 500000 splits
        I181207 14:24:17.769891 636 workload/workload.go:463  finished 376000 of 500000 splits
        I181207 14:24:32.065198 636 workload/workload.go:463  finished 377000 of 500000 splits
        I181207 14:24:38.139086 636 workload/workload.go:463  finished 378000 of 500000 splits
        I181207 14:24:51.812791 636 workload/workload.go:463  finished 379000 of 500000 splits
        I181207 14:25:00.071184 636 workload/workload.go:463  finished 380000 of 500000 splits
        I181207 14:25:07.532477 636 workload/workload.go:463  finished 381000 of 500000 splits
        I181207 14:25:17.680893 636 workload/workload.go:463  finished 382000 of 500000 splits
        I181207 15:31:46.327559 636 workload/workload.go:463  finished 383000 of 500000 splits
        : signal: killed
    test.go:630,cluster.go:1486,kv.go:198: Goexit() was called

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/153fcf347cd6a53517f607ec1e6c7c14193640dd

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/splits/nodes=3/quiesce=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1052961&tab=buildLog

The test failed on release-2.1:
    test.go:630,test.go:642: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1052961-kv-splits-nodes-3-quiesce-false -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
        stderr:

        stdout:
        e: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181030 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script617345990 --project cockroach-ephemeral]
        Output: WARNING: Some requests generated warnings:
         - The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181030' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181204'.

        ERROR: (gcloud.compute.instances.create) Could not fetch resource:
         - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.

        : exit status 1
        Cleaning up...
        : exit status 1

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/153fcf347cd6a53517f607ec1e6c7c14193640dd

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1052961&tab=buildLog

The test failed on release-2.1:
    test.go:630,test.go:642: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1052961-kv-splits-nodes-3-quiesce-true -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
        stderr:

        stdout:
        ockroach-ephemeral]
        Output: Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-central1-b/instances/teamcity-1052961-kv-splits-nodes-3-quiesce-true-0002].
        Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-central1-b/instances/teamcity-1052961-kv-splits-nodes-3-quiesce-true-0004].
        Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-central1-b/instances/teamcity-1052961-kv-splits-nodes-3-quiesce-true-0001].
        WARNING: Some requests generated warnings:
         - The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181030' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181204'.

        ERROR: (gcloud.compute.instances.create) Could not fetch resource:
         - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.

        : exit status 1
        Cleaning up...
        : exit status 1

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/e7dc507fa0ecc7dc5ed597ca5c6cdeb48086428c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1057350&tab=buildLog

The test failed on master:
    test.go:1023: test timed out (2h0m0s)
    test.go:628,cluster.go:1139,kv.go:195,cluster.go:1465,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1057350-kv-splits-nodes-3-quiesce-true:4 -- ./workload run kv --init --max-ops=1 --concurrency=192 --splits=500000 {pgurl:1-3} returned:
        stderr:

        stdout:
        234 631 workload/workload.go:463  finished 464000 of 500000 splits
        I181216 14:52:55.219772 631 workload/workload.go:463  finished 465000 of 500000 splits
        I181216 14:53:04.394082 631 workload/workload.go:463  finished 466000 of 500000 splits
        I181216 14:53:17.281560 631 workload/workload.go:463  finished 467000 of 500000 splits
        I181216 14:53:34.485000 631 workload/workload.go:463  finished 468000 of 500000 splits
        I181216 14:53:44.689738 631 workload/workload.go:463  finished 469000 of 500000 splits
        I181216 14:53:55.897354 631 workload/workload.go:463  finished 470000 of 500000 splits
        I181216 14:54:08.928215 631 workload/workload.go:463  finished 471000 of 500000 splits
        I181216 14:54:21.741356 631 workload/workload.go:463  finished 472000 of 500000 splits
        I181216 14:54:32.213064 631 workload/workload.go:463  finished 473000 of 500000 splits
        I181216 14:54:47.988957 631 workload/workload.go:463  finished 474000 of 500000 splits
        I181216 14:55:11.154051 631 workload/workload.go:463  finished 475000 of 500000 splits
        : signal: killed
    test.go:628,cluster.go:1486,kv.go:198: Goexit() was called

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/ea4c00f5d7c33ece947a080cf56e63a33826565b

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/splits/nodes=3/quiesce=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1064508&tab=buildLog

The test failed on master:
    test.go:703,test.go:715: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1064508-kv-splits-nodes-3-quiesce-false -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
        stderr:

        stdout:
        64508-kv-splits-nodes-3-quiesce-false-0004.us-central1-b.cockroach-ephemeral    10.128.0.25 35.194.21.54
        Syncing...
        teamcity-1064508-kv-splits-nodes-3-quiesce-false: waiting for nodes to start.........................................................................................................................................
        generating ssh key.
        distributing ssh key..................................................................................................................................
        2: exit status 255
        ~ tar xf -

        github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).SetupSSH.func2
            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:521
        github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1155
        runtime.goexit
            /usr/local/go/src/runtime/asm_amd64.s:2361: 
        2018/12/20 13:06:27 command failed
        : exit status 1

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/ea4c00f5d7c33ece947a080cf56e63a33826565b

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/splits/nodes=3/quiesce=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1064508&tab=buildLog

The test failed on master:
    test.go:703,test.go:715: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1064508-kv-splits-nodes-3-quiesce-true -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
        stderr:

        stdout:
        ......................................................................................................................................................
        0: exit status 255
        ~ 
        for i in {1..20}; do
          ssh-keyscan -T 60 -t rsa 10.128.0.28 10.128.0.29 10.128.0.26 10.128.0.27 35.192.116.206 35.188.143.1 35.192.29.58 35.193.111.207 > .ssh/known_hosts.tmp
          if [ "$(cat .ssh/known_hosts.tmp | wc -l)" -eq "8" ]; then
            cat .ssh/known_hosts.tmp >> .ssh/known_hosts
            rm -f .ssh/known_hosts.tmp
            exit 0
          fi
          sleep 1
        done
        exit 1

        github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).SetupSSH.func4
            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:574
        github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1155
        runtime.goexit
            /usr/local/go/src/runtime/asm_amd64.s:2361: 
        2018/12/20 13:15:05 command failed
        : exit status 1