cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

storage: TestRaftRemoveRace failed under stress #15687

Closed cockroach-teamcity closed 7 years ago

cockroach-teamcity commented 7 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/e41b5f3c13f492369d48d09c921b3732209f11e2

Parameters:

TAGS=
GOFLAGS=-race

Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=242304&tab=buildLog

I170504 07:23:54.937595 38361 storage/store.go:1250  [n1,s1]: failed initial metrics computation: [n1,s1]: system config not yet available
I170504 07:23:54.937855 38361 gossip/gossip.go:297  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"127.0.0.1:44469" > attrs:<> locality:<> 
W170504 07:23:54.958221 38361 gossip/gossip.go:1189  [n2] no incoming or outgoing connections
I170504 07:23:54.960425 38330 gossip/client.go:131  [n2] started gossip client to 127.0.0.1:44469
I170504 07:23:54.967481 38361 storage/store.go:1250  [n2,s2]: failed initial metrics computation: [n2,s2]: system config not yet available
I170504 07:23:54.967750 38361 gossip/gossip.go:297  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:49614" > attrs:<> locality:<> 
W170504 07:23:54.987297 38361 gossip/gossip.go:1189  [n3] no incoming or outgoing connections
I170504 07:23:54.988720 38334 gossip/client.go:131  [n3] started gossip client to 127.0.0.1:44469
I170504 07:23:55.004060 38361 gossip/gossip.go:297  [n3] NodeDescriptor set to node_id:3 address:<network_field:"tcp" address_field:"127.0.0.1:45960" > attrs:<> locality:<> 
W170504 07:23:55.030485 38361 gossip/gossip.go:1189  [n4] no incoming or outgoing connections
I170504 07:23:55.032618 38464 gossip/client.go:131  [n4] started gossip client to 127.0.0.1:44469
I170504 07:23:55.041623 38361 storage/store.go:1250  [n4,s4]: failed initial metrics computation: [n4,s4]: system config not yet available
I170504 07:23:55.042869 38361 gossip/gossip.go:297  [n4] NodeDescriptor set to node_id:4 address:<network_field:"tcp" address_field:"127.0.0.1:50907" > attrs:<> locality:<> 
W170504 07:23:55.058939 38361 gossip/gossip.go:1189  [n5] no incoming or outgoing connections
I170504 07:23:55.062977 38687 gossip/client.go:131  [n5] started gossip client to 127.0.0.1:44469
I170504 07:23:55.072121 38797 gossip/server.go:285  [n1] refusing gossip from node 5 (max 3 conns); forwarding to 2 ({tcp 127.0.0.1:49614})
I170504 07:23:55.072991 38361 storage/store.go:1250  [n5,s5]: failed initial metrics computation: [n5,s5]: system config not yet available
I170504 07:23:55.074588 38361 gossip/gossip.go:297  [n5] NodeDescriptor set to node_id:5 address:<network_field:"tcp" address_field:"127.0.0.1:34089" > attrs:<> locality:<> 
I170504 07:23:55.078425 38687 gossip/client.go:136  [n5] closing client to node 1 (127.0.0.1:44469): received forward from node 1 to 2 (127.0.0.1:49614)
I170504 07:23:55.080109 38845 gossip/gossip.go:1203  [n5] node has connected to cluster via gossip
I170504 07:23:55.080813 38736 gossip/client.go:131  [n5] started gossip client to 127.0.0.1:49614
W170504 07:23:55.099176 38361 gossip/gossip.go:1189  [n6] no incoming or outgoing connections
I170504 07:23:55.100783 38997 gossip/client.go:131  [n6] started gossip client to 127.0.0.1:44469
I170504 07:23:55.102194 39000 gossip/server.go:285  [n1] refusing gossip from node 6 (max 3 conns); forwarding to 3 ({tcp 127.0.0.1:45960})
I170504 07:23:55.106861 38997 gossip/client.go:136  [n6] closing client to node 1 (127.0.0.1:44469): received forward from node 1 to 3 (127.0.0.1:45960)
I170504 07:23:55.107398 38960 gossip/gossip.go:1203  [n6] node has connected to cluster via gossip
I170504 07:23:55.108186 39053 gossip/client.go:131  [n6] started gossip client to 127.0.0.1:45960
I170504 07:23:55.111786 38361 gossip/gossip.go:297  [n6] NodeDescriptor set to node_id:6 address:<network_field:"tcp" address_field:"127.0.0.1:44189" > attrs:<> locality:<> 
W170504 07:23:55.131453 38361 gossip/gossip.go:1189  [n7] no incoming or outgoing connections
I170504 07:23:55.133928 39102 gossip/server.go:285  [n1] refusing gossip from node 7 (max 3 conns); forwarding to 4 ({tcp 127.0.0.1:50907})
I170504 07:23:55.134397 39154 gossip/client.go:131  [n7] started gossip client to 127.0.0.1:44469
I170504 07:23:55.146237 38361 storage/store.go:1250  [n7,s7]: failed initial metrics computation: [n7,s7]: system config not yet available
I170504 07:23:55.146514 38361 gossip/gossip.go:297  [n7] NodeDescriptor set to node_id:7 address:<network_field:"tcp" address_field:"127.0.0.1:59291" > attrs:<> locality:<> 
I170504 07:23:55.150129 39154 gossip/client.go:136  [n7] closing client to node 1 (127.0.0.1:44469): received forward from node 1 to 4 (127.0.0.1:50907)
I170504 07:23:55.150617 39142 gossip/gossip.go:1203  [n7] node has connected to cluster via gossip
I170504 07:23:55.154536 39253 gossip/client.go:131  [n7] started gossip client to 127.0.0.1:50907
W170504 07:23:55.187159 38361 gossip/gossip.go:1189  [n8] no incoming or outgoing connections
I170504 07:23:55.190306 39262 gossip/client.go:131  [n8] started gossip client to 127.0.0.1:44469
I170504 07:23:55.191238 38967 gossip/server.go:285  [n1] refusing gossip from node 8 (max 3 conns); forwarding to 2 ({tcp 127.0.0.1:49614})
I170504 07:23:55.202937 39262 gossip/client.go:136  [n8] closing client to node 1 (127.0.0.1:44469): received forward from node 1 to 2 (127.0.0.1:49614)
I170504 07:23:55.203602 39306 gossip/gossip.go:1203  [n8] node has connected to cluster via gossip
I170504 07:23:55.204540 39226 gossip/client.go:131  [n8] started gossip client to 127.0.0.1:49614
I170504 07:23:55.211857 38361 gossip/gossip.go:297  [n8] NodeDescriptor set to node_id:8 address:<network_field:"tcp" address_field:"127.0.0.1:51208" > attrs:<> locality:<> 
W170504 07:23:55.234115 38361 gossip/gossip.go:1189  [n9] no incoming or outgoing connections
I170504 07:23:55.240831 39480 gossip/client.go:131  [n9] started gossip client to 127.0.0.1:44469
I170504 07:23:55.242095 39451 gossip/server.go:285  [n1] refusing gossip from node 9 (max 3 conns); forwarding to 3 ({tcp 127.0.0.1:45960})
I170504 07:23:55.246354 38361 storage/store.go:1250  [n9,s9]: failed initial metrics computation: [n9,s9]: system config not yet available
I170504 07:23:55.246665 38361 gossip/gossip.go:297  [n9] NodeDescriptor set to node_id:9 address:<network_field:"tcp" address_field:"127.0.0.1:58810" > attrs:<> locality:<> 
I170504 07:23:55.260871 39480 gossip/client.go:136  [n9] closing client to node 1 (127.0.0.1:44469): received forward from node 1 to 3 (127.0.0.1:45960)
I170504 07:23:55.261467 39497 gossip/gossip.go:1203  [n9] node has connected to cluster via gossip
I170504 07:23:55.262221 39483 gossip/client.go:131  [n9] started gossip client to 127.0.0.1:45960
W170504 07:23:55.276586 38361 gossip/gossip.go:1189  [n10] no incoming or outgoing connections
I170504 07:23:55.283615 39682 gossip/client.go:131  [n10] started gossip client to 127.0.0.1:44469
I170504 07:23:55.284846 39714 gossip/server.go:285  [n1] refusing gossip from node 10 (max 3 conns); forwarding to 3 ({tcp 127.0.0.1:45960})
I170504 07:23:55.290780 38361 storage/store.go:1250  [n10,s10]: failed initial metrics computation: [n10,s10]: system config not yet available
I170504 07:23:55.291059 38361 gossip/gossip.go:297  [n10] NodeDescriptor set to node_id:10 address:<network_field:"tcp" address_field:"127.0.0.1:53797" > attrs:<> locality:<> 
I170504 07:23:55.299642 39682 gossip/client.go:136  [n10] closing client to node 1 (127.0.0.1:44469): received forward from node 1 to 3 (127.0.0.1:45960)
I170504 07:23:55.301447 39329 gossip/gossip.go:1203  [n10] node has connected to cluster via gossip
I170504 07:23:55.302745 39584 gossip/client.go:131  [n10] started gossip client to 127.0.0.1:45960
I170504 07:23:55.322774 39797 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot c048866b at index 22
I170504 07:23:55.328551 39797 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n2,s2):?: kv pairs: 40, log entries: 12, rate-limit: 2.0 MiB/sec, 5ms
I170504 07:23:55.329077 39776 storage/replica_raftstorage.go:595  [s2,r1/?:{-}] applying preemptive snapshot at index 22 (id=c048866b, encoded size=8987, 1 rocksdb batches, 12 log entries)
I170504 07:23:55.333536 39776 storage/replica_raftstorage.go:603  [s2,r1/?:/M{in-ax}] applied preemptive snapshot in 4ms [clear=0ms batch=0ms entries=3ms commit=0ms]
I170504 07:23:55.336317 39797 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n2,s2):2): read existing descriptor r1:/M{in-ax} [(n1,s1):1, next=2]
I170504 07:23:55.342635 39847 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n2,s2):2: [(n1,s1):1 (n2,s2):2]
I170504 07:23:55.354519 39615 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot b28371e1 at index 24
I170504 07:23:55.357427 39615 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):?: kv pairs: 43, log entries: 14, rate-limit: 2.0 MiB/sec, 2ms
I170504 07:23:55.358404 39799 storage/replica_raftstorage.go:595  [s3,r1/?:{-}] applying preemptive snapshot at index 24 (id=b28371e1, encoded size=10547, 1 rocksdb batches, 14 log entries)
I170504 07:23:55.360690 39799 storage/replica_raftstorage.go:603  [s3,r1/?:/M{in-ax}] applied preemptive snapshot in 2ms [clear=0ms batch=0ms entries=1ms commit=0ms]
I170504 07:23:55.362920 39615 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n3,s3):3): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, next=3]
I170504 07:23:55.369054 39727 storage/raft_transport.go:436  raft transport stream to node 1 established
I170504 07:23:55.376869 39804 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n3,s3):3: [(n1,s1):1 (n2,s2):2 (n3,s3):3]
I170504 07:23:55.386969 39641 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot 497b69a2 at index 26
I170504 07:23:55.390649 39641 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n4,s4):?: kv pairs: 46, log entries: 16, rate-limit: 2.0 MiB/sec, 2ms
I170504 07:23:55.391701 39838 storage/replica_raftstorage.go:595  [s4,r1/?:{-}] applying preemptive snapshot at index 26 (id=497b69a2, encoded size=12171, 1 rocksdb batches, 16 log entries)
I170504 07:23:55.393925 39838 storage/replica_raftstorage.go:603  [s4,r1/?:/M{in-ax}] applied preemptive snapshot in 2ms [clear=0ms batch=0ms entries=1ms commit=0ms]
I170504 07:23:55.396258 39641 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n4,s4):4): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4]
I170504 07:23:55.404403 39922 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n4,s4):4: [(n1,s1):1 (n2,s2):2 (n3,s3):3 (n4,s4):4]
I170504 07:23:55.417521 39872 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot 981bc02d at index 28
I170504 07:23:55.420360 39872 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n5,s5):?: kv pairs: 49, log entries: 18, rate-limit: 2.0 MiB/sec, 2ms
I170504 07:23:55.432876 39938 storage/replica_raftstorage.go:595  [s5,r1/?:{-}] applying preemptive snapshot at index 28 (id=981bc02d, encoded size=13859, 1 rocksdb batches, 18 log entries)
I170504 07:23:55.435252 39938 storage/replica_raftstorage.go:603  [s5,r1/?:/M{in-ax}] applied preemptive snapshot in 2ms [clear=0ms batch=0ms entries=1ms commit=0ms]
I170504 07:23:55.438679 39872 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n5,s5):5): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5]
I170504 07:23:55.451187 39911 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n5,s5):5: [(n1,s1):1 (n2,s2):2 (n3,s3):3 (n4,s4):4 (n5,s5):5]
I170504 07:23:55.463323 39916 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot df0f35a4 at index 30
I170504 07:23:55.467124 39916 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n6,s6):?: kv pairs: 52, log entries: 20, rate-limit: 2.0 MiB/sec, 3ms
I170504 07:23:55.469002 39924 storage/replica_raftstorage.go:595  [s6,r1/?:{-}] applying preemptive snapshot at index 30 (id=df0f35a4, encoded size=15611, 1 rocksdb batches, 20 log entries)
I170504 07:23:55.471963 39924 storage/replica_raftstorage.go:603  [s6,r1/?:/M{in-ax}] applied preemptive snapshot in 3ms [clear=0ms batch=0ms entries=2ms commit=0ms]
I170504 07:23:55.474666 39916 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n6,s6):6): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, (n5,s5):5, next=6]
I170504 07:23:55.488004 39905 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n6,s6):6: [(n1,s1):1 (n2,s2):2 (n3,s3):3 (n4,s4):4 (n5,s5):5 (n6,s6):6]
I170504 07:23:55.507521 39990 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot 0e0e2949 at index 32
I170504 07:23:55.512414 39990 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n7,s7):?: kv pairs: 55, log entries: 22, rate-limit: 2.0 MiB/sec, 4ms
I170504 07:23:55.513501 39946 storage/replica_raftstorage.go:595  [s7,r1/?:{-}] applying preemptive snapshot at index 32 (id=0e0e2949, encoded size=17427, 1 rocksdb batches, 22 log entries)
I170504 07:23:55.516058 39946 storage/replica_raftstorage.go:603  [s7,r1/?:/M{in-ax}] applied preemptive snapshot in 2ms [clear=0ms batch=0ms entries=1ms commit=0ms]
I170504 07:23:55.518834 39990 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n7,s7):7): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, (n5,s5):5, (n6,s6):6, next=7]
I170504 07:23:55.529380 40035 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n7,s7):7: [(n1,s1):1 (n2,s2):2 (n3,s3):3 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7]
I170504 07:23:55.545146 40010 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot 01588ee0 at index 34
I170504 07:23:55.548941 40010 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n8,s8):?: kv pairs: 58, log entries: 24, rate-limit: 2.0 MiB/sec, 3ms
I170504 07:23:55.550175 40024 storage/replica_raftstorage.go:595  [s8,r1/?:{-}] applying preemptive snapshot at index 34 (id=01588ee0, encoded size=19307, 1 rocksdb batches, 24 log entries)
I170504 07:23:55.552912 40024 storage/replica_raftstorage.go:603  [s8,r1/?:/M{in-ax}] applied preemptive snapshot in 3ms [clear=0ms batch=0ms entries=2ms commit=0ms]
I170504 07:23:55.555209 40010 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n8,s8):8): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, next=8]
I170504 07:23:55.565170 39975 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n8,s8):8: [(n1,s1):1 (n2,s2):2 (n3,s3):3 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8]
I170504 07:23:55.589083 40016 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot 6bdabacb at index 36
I170504 07:23:55.596990 40016 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n9,s9):?: kv pairs: 61, log entries: 26, rate-limit: 2.0 MiB/sec, 7ms
I170504 07:23:55.598400 39999 storage/replica_raftstorage.go:595  [s9,r1/?:{-}] applying preemptive snapshot at index 36 (id=6bdabacb, encoded size=21247, 1 rocksdb batches, 26 log entries)
I170504 07:23:55.601745 39999 storage/replica_raftstorage.go:603  [s9,r1/?:/M{in-ax}] applied preemptive snapshot in 3ms [clear=0ms batch=0ms entries=2ms commit=0ms]
I170504 07:23:55.604569 40016 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n9,s9):9): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, next=9]
I170504 07:23:55.634484 39981 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n9,s9):9: [(n1,s1):1 (n2,s2):2 (n3,s3):3 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9]
I170504 07:23:55.661627 40030 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot f68581e1 at index 38
I170504 07:23:55.667724 40030 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n10,s10):?: kv pairs: 64, log entries: 28, rate-limit: 2.0 MiB/sec, 5ms
I170504 07:23:55.711751 40046 storage/replica_raftstorage.go:595  [s10,r1/?:{-}] applying preemptive snapshot at index 38 (id=f68581e1, encoded size=23255, 1 rocksdb batches, 28 log entries)
I170504 07:23:55.728247 40046 storage/replica_raftstorage.go:603  [s10,r1/?:/M{in-ax}] applied preemptive snapshot in 16ms [clear=0ms batch=0ms entries=15ms commit=0ms]
I170504 07:23:55.732416 40030 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n10,s10):10): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, next=10]
I170504 07:23:55.767686 40121 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n10,s10):10: [(n1,s1):1 (n2,s2):2 (n3,s3):3 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9 (n10,s10):10]
I170504 07:23:55.898208 40160 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (REMOVE_REPLICA (n3,s3):3): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, (n10,s10):10, next=11]
I170504 07:23:55.990190 40291 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing REMOVE_REPLICA (n3,s3):3: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9]
I170504 07:23:56.010319 39729 storage/store.go:3149  [s3,r1/3:/M{in-ax}] added to replica GC queue (peer suggestion)
I170504 07:23:56.042444 38633 storage/store.go:2137  [replicaGC,s3,r1/3:/M{in-ax}] removing replica
I170504 07:23:56.044205 38633 storage/replica.go:673  [replicaGC,s3,r1/3:/M{in-ax}] removed 40 (32+8) keys in 0ms [clear=0ms commit=0ms]
I170504 07:23:56.047113 40308 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot 95d85a97 at index 51
I170504 07:23:56.063070 40308 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):?: kv pairs: 79, log entries: 41, rate-limit: 2.0 MiB/sec, 7ms
I170504 07:23:56.069975 40332 storage/replica_raftstorage.go:595  [s3,r1/?:{-}] applying preemptive snapshot at index 51 (id=95d85a97, encoded size=31727, 1 rocksdb batches, 41 log entries)
I170504 07:23:56.096093 40332 storage/replica_raftstorage.go:603  [s3,r1/?:/M{in-ax}] applied preemptive snapshot in 25ms [clear=0ms batch=0ms entries=6ms commit=1ms]
I170504 07:23:56.101390 40308 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n3,s3):11): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, next=11]
I170504 07:23:56.129251 40314 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n3,s3):11: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9 (n3,s3):11]
I170504 07:23:56.224585 40345 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (REMOVE_REPLICA (n3,s3):11): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, (n3,s3):11, next=12]
I170504 07:23:56.243406 40354 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing REMOVE_REPLICA (n3,s3):11: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9]
I170504 07:23:56.286148 40374 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot c0c88c90 at index 56
I170504 07:23:56.291311 39729 storage/store.go:3149  [s3,r1/11:/M{in-ax}] added to replica GC queue (peer suggestion)
I170504 07:23:56.293779 38633 storage/store.go:2137  [replicaGC,s3,r1/11:/M{in-ax}] removing replica
I170504 07:23:56.295274 38633 storage/replica.go:673  [replicaGC,s3,r1/11:/M{in-ax}] removed 40 (32+8) keys in 1ms [clear=0ms commit=1ms]
I170504 07:23:56.301186 40374 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):?: kv pairs: 86, log entries: 46, rate-limit: 2.0 MiB/sec, 12ms
I170504 07:23:56.304893 40377 storage/replica_raftstorage.go:595  [s3,r1/?:{-}] applying preemptive snapshot at index 56 (id=c0c88c90, encoded size=36295, 1 rocksdb batches, 46 log entries)
I170504 07:23:56.309510 40377 storage/replica_raftstorage.go:603  [s3,r1/?:/M{in-ax}] applied preemptive snapshot in 4ms [clear=0ms batch=0ms entries=3ms commit=1ms]
I170504 07:23:56.313223 40374 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n3,s3):12): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, next=12]
I170504 07:23:56.335328 40386 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n3,s3):12: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9 (n3,s3):12]
I170504 07:23:56.461124 40452 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (REMOVE_REPLICA (n3,s3):12): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, (n3,s3):12, next=13]
I170504 07:23:56.595702 40487 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing REMOVE_REPLICA (n3,s3):12: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9]
I170504 07:23:56.715623 39729 storage/store.go:3149  [s3,r1/12:/M{in-ax}] added to replica GC queue (peer suggestion)
I170504 07:23:56.737213 38633 storage/store.go:2137  [replicaGC,s3,r1/12:/M{in-ax}] removing replica
I170504 07:23:56.754072 38633 storage/replica.go:673  [replicaGC,s3,r1/12:/M{in-ax}] removed 40 (32+8) keys in 16ms [clear=16ms commit=1ms]
I170504 07:23:56.756067 40497 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot e38fa63c at index 65
I170504 07:23:56.768292 40497 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):?: kv pairs: 97, log entries: 55, rate-limit: 2.0 MiB/sec, 11ms
I170504 07:23:56.776343 40479 storage/replica_raftstorage.go:595  [s3,r1/?:{-}] applying preemptive snapshot at index 65 (id=e38fa63c, encoded size=42816, 1 rocksdb batches, 55 log entries)
E170504 07:23:56.803851 39729 storage/store.go:3147  [s3,r1/?:{-}] unable to add to replica GC queue: replica not initialized
I170504 07:23:56.820392 40479 storage/replica_raftstorage.go:603  [s3,r1/?:/M{in-ax}] applied preemptive snapshot in 42ms [clear=0ms batch=0ms entries=36ms commit=1ms]
I170504 07:23:56.829707 40497 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n3,s3):13): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, next=13]
I170504 07:23:56.908239 40642 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n3,s3):13: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9 (n3,s3):13]
I170504 07:23:57.082675 40631 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (REMOVE_REPLICA (n3,s3):13): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, (n3,s3):13, next=14]
I170504 07:23:57.108085 40613 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing REMOVE_REPLICA (n3,s3):13: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9]
I170504 07:23:57.194736 40559 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot 46255ec3 at index 75
I170504 07:23:57.196915 38633 storage/store.go:2137  [replicaGC,s3,r1/13:/M{in-ax}] removing replica
I170504 07:23:57.198191 38633 storage/replica.go:673  [replicaGC,s3,r1/13:/M{in-ax}] removed 40 (32+8) keys in 1ms [clear=0ms commit=1ms]
I170504 07:23:57.235775 40559 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):?: kv pairs: 109, log entries: 65, rate-limit: 2.0 MiB/sec, 35ms
I170504 07:23:57.260620 40561 storage/replica_raftstorage.go:595  [s3,r1/?:{-}] applying preemptive snapshot at index 75 (id=46255ec3, encoded size=49824, 1 rocksdb batches, 65 log entries)
I170504 07:23:57.267504 40561 storage/replica_raftstorage.go:603  [s3,r1/?:/M{in-ax}] applied preemptive snapshot in 6ms [clear=0ms batch=0ms entries=4ms commit=1ms]
I170504 07:23:57.295479 40559 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n3,s3):14): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, next=14]
I170504 07:23:57.496515 40681 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n3,s3):14: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9 (n3,s3):14]
I170504 07:23:57.688143 40799 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (REMOVE_REPLICA (n3,s3):14): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, (n3,s3):14, next=15]
I170504 07:23:57.761752 40716 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing REMOVE_REPLICA (n3,s3):14: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9]
I170504 07:23:57.821211 39729 storage/store.go:3149  [s3,r1/14:/M{in-ax}] added to replica GC queue (peer suggestion)
I170504 07:23:57.827296 38633 storage/store.go:2137  [replicaGC,s3,r1/14:/M{in-ax}] removing replica
I170504 07:23:57.828612 38633 storage/replica.go:673  [replicaGC,s3,r1/14:/M{in-ax}] removed 40 (32+8) keys in 1ms [clear=0ms commit=0ms]
I170504 07:23:57.839896 40784 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot c0d5d8c3 at index 88
I170504 07:23:57.947928 40784 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):?: kv pairs: 124, log entries: 78, rate-limit: 2.0 MiB/sec, 95ms
I170504 07:23:57.956265 40898 storage/replica_raftstorage.go:595  [s3,r1/?:{-}] applying preemptive snapshot at index 88 (id=c0d5d8c3, encoded size=58297, 1 rocksdb batches, 78 log entries)
I170504 07:23:57.990180 40898 storage/replica_raftstorage.go:603  [s3,r1/?:/M{in-ax}] applied preemptive snapshot in 34ms [clear=24ms batch=0ms entries=8ms commit=1ms]
I170504 07:23:57.999949 40784 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n3,s3):15): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, next=15]
I170504 07:23:58.514724 40927 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n3,s3):15: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9 (n3,s3):15]
I170504 07:23:58.790374 41039 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (REMOVE_REPLICA (n3,s3):15): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, (n3,s3):15, next=16]
I170504 07:23:59.210945 41109 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing REMOVE_REPLICA (n3,s3):15: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9]
I170504 07:23:59.342835 39729 storage/store.go:3149  [s3,r1/15:/M{in-ax}] added to replica GC queue (peer suggestion)
I170504 07:23:59.362733 41144 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot c6b276e3 at index 112
I170504 07:23:59.428431 38633 storage/store.go:2137  [replicaGC,s3,r1/15:/M{in-ax}] removing replica
I170504 07:23:59.508554 38633 storage/replica.go:673  [replicaGC,s3,r1/15:/M{in-ax}] removed 40 (32+8) keys in 1ms [clear=0ms commit=0ms]
I170504 07:23:59.521670 41266 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot dedd554e at index 112
I170504 07:23:59.626339 41266 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):?: kv pairs: 150, log entries: 102, rate-limit: 2.0 MiB/sec, 104ms
I170504 07:23:59.643498 41268 storage/replica_raftstorage.go:595  [s3,r1/?:{-}] applying preemptive snapshot at index 112 (id=dedd554e, encoded size=72137, 1 rocksdb batches, 102 log entries)
I170504 07:23:59.738441 41268 storage/replica_raftstorage.go:603  [s3,r1/?:/M{in-ax}] applied preemptive snapshot in 95ms [clear=0ms batch=0ms entries=68ms commit=1ms]
I170504 07:23:59.745070 41266 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n3,s3):16): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, next=16]
I170504 07:24:00.148095 41397 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n3,s3):16: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9 (n3,s3):16]
I170504 07:24:00.722692 41473 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (REMOVE_REPLICA (n3,s3):16): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, (n3,s3):16, next=17]
I170504 07:24:01.702537 41547 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing REMOVE_REPLICA (n3,s3):16: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9]
I170504 07:24:01.822246 38633 storage/store.go:2137  [replicaGC,s3,r1/16:/M{in-ax}] removing replica
I170504 07:24:01.891647 38633 storage/replica.go:673  [replicaGC,s3,r1/16:/M{in-ax}] removed 40 (32+8) keys in 69ms [clear=69ms commit=0ms]
I170504 07:24:01.893635 41808 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot b408cccb at index 147
I170504 07:24:01.902209 41808 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):?: kv pairs: 186, log entries: 16, rate-limit: 2.0 MiB/sec, 5ms
I170504 07:24:01.905575 41726 storage/replica_raftstorage.go:595  [s3,r1/?:{-}] applying preemptive snapshot at index 147 (id=b408cccb, encoded size=24881, 1 rocksdb batches, 16 log entries)
I170504 07:24:01.908749 41726 storage/replica_raftstorage.go:603  [s3,r1/?:/M{in-ax}] applied preemptive snapshot in 3ms [clear=0ms batch=0ms entries=1ms commit=1ms]
I170504 07:24:01.920524 41808 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n3,s3):17): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, next=17]
I170504 07:24:01.968073 39729 storage/store.go:3149  [s3,r1/?:/M{in-ax}] added to replica GC queue (peer suggestion)
I170504 07:24:01.970487 39729 storage/store.go:3149  [s3,r1/?:/M{in-ax}] added to replica GC queue (peer suggestion)
I170504 07:24:01.974273 38633 storage/store.go:2137  [replicaGC,s3,r1/?:/M{in-ax}] removing replica
I170504 07:24:01.975366 38633 storage/replica.go:673  [replicaGC,s3,r1/?:/M{in-ax}] removed 40 (32+8) keys in 1ms [clear=0ms commit=0ms]
I170504 07:24:03.155410 42020 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n3,s3):17: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9 (n3,s3):17]
I170504 07:24:03.610108 38454 storage/replica_raftstorage.go:415  [raftsnapshot,s1,r1/1:/M{in-ax}] generated Raft snapshot 0a5dfe9f at index 175
I170504 07:24:03.619784 38454 storage/store.go:3342  [raftsnapshot,s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):17: kv pairs: 215, log entries: 44, rate-limit: 8.0 MiB/sec, 9ms
I170504 07:24:03.622125 42154 storage/replica_raftstorage.go:595  [s3,r1/17:{-}] applying Raft snapshot at index 175 (id=0a5dfe9f, encoded size=39697, 1 rocksdb batches, 44 log entries)
I170504 07:24:03.642543 42154 storage/replica_raftstorage.go:603  [s3,r1/17:/M{in-ax}] applied Raft snapshot in 20ms [clear=0ms batch=0ms entries=17ms commit=2ms]
W170504 07:24:03.824287 42404 storage/replica.go:1684  [hb,s10,r1/10:/M{in-ax}] context canceled before command queue: [txn: 348aa2cc], BeginTransaction [/System/NodeLiveness/10,/Min), ConditionalPut [/System/NodeLiveness/10,/Min), EndTransaction [/System/NodeLiveness/10,/Min)
I170504 07:24:04.050876 42308 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (REMOVE_REPLICA (n3,s3):17): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, (n3,s3):17, next=18]
I170504 07:24:05.300339 42522 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing REMOVE_REPLICA (n3,s3):17: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9]
I170504 07:24:05.405813 39729 storage/store.go:3149  [s3,r1/17:/M{in-ax}] added to replica GC queue (peer suggestion)
I170504 07:24:05.412870 38633 storage/store.go:2137  [replicaGC,s3,r1/17:/M{in-ax}] removing replica
I170504 07:24:05.424606 38633 storage/replica.go:673  [replicaGC,s3,r1/17:/M{in-ax}] removed 40 (32+8) keys in 5ms [clear=5ms commit=0ms]
I170504 07:24:05.427260 42727 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot 1462c4c4 at index 201
I170504 07:24:05.520431 42727 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):?: kv pairs: 242, log entries: 70, rate-limit: 2.0 MiB/sec, 92ms
I170504 07:24:05.531316 42689 storage/replica_raftstorage.go:595  [s3,r1/?:{-}] applying preemptive snapshot at index 201 (id=1462c4c4, encoded size=53469, 1 rocksdb batches, 70 log entries)
I170504 07:24:05.538077 42689 storage/replica_raftstorage.go:603  [s3,r1/?:/M{in-ax}] applied preemptive snapshot in 7ms [clear=0ms batch=0ms entries=4ms commit=1ms]
I170504 07:24:05.625160 39729 storage/store.go:3149  [s3,r1/?:/M{in-ax}] added to replica GC queue (peer suggestion)
I170504 07:24:05.630885 38633 storage/store.go:2137  [replicaGC,s3,r1/?:/M{in-ax}] removing replica
I170504 07:24:05.653013 38633 storage/replica.go:673  [replicaGC,s3,r1/?:/M{in-ax}] removed 40 (32+8) keys in 1ms [clear=0ms commit=0ms]
I170504 07:24:05.678607 42727 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n3,s3):18): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, next=18]
I170504 07:24:06.692249 42932 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n3,s3):18: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9 (n3,s3):18]
I170504 07:24:06.888398 38454 storage/replica_raftstorage.go:415  [raftsnapshot,s1,r1/1:/M{in-ax}] generated Raft snapshot 903f8983 at index 223
I170504 07:24:06.896013 38454 storage/store.go:3342  [raftsnapshot,s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):18: kv pairs: 265, log entries: 92, rate-limit: 8.0 MiB/sec, 7ms
I170504 07:24:06.912316 43130 storage/replica_raftstorage.go:595  [s3,r1/18:{-}] applying Raft snapshot at index 223 (id=903f8983, encoded size=65345, 1 rocksdb batches, 92 log entries)
I170504 07:24:06.945031 43130 storage/replica_raftstorage.go:603  [s3,r1/18:/M{in-ax}] applied Raft snapshot in 32ms [clear=0ms batch=0ms entries=30ms commit=1ms]
I170504 07:24:07.012639 43037 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (REMOVE_REPLICA (n3,s3):18): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, (n3,s3):18, next=19]
I170504 07:24:07.653741 43287 storage/replica.go:2600  [s1,r1/1:/M{in-ax}] proposing REMOVE_REPLICA (n3,s3):18: [(n1,s1):1 (n2,s2):2 (n10,s10):10 (n4,s4):4 (n5,s5):5 (n6,s6):6 (n7,s7):7 (n8,s8):8 (n9,s9):9]
I170504 07:24:07.884314 43379 storage/replica_raftstorage.go:415  [s1,r1/1:/M{in-ax}] generated preemptive snapshot da3f9085 at index 238
I170504 07:24:07.900956 43379 storage/store.go:3342  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):?: kv pairs: 280, log entries: 4, rate-limit: 2.0 MiB/sec, 14ms
I170504 07:24:07.932491 39729 storage/store.go:3149  [s3,r1/18:/M{in-ax}] added to replica GC queue (peer suggestion)
I170504 07:24:07.935910 38633 storage/store.go:2137  [replicaGC,s3,r1/18:/M{in-ax}] removing replica
I170504 07:24:07.957596 38633 storage/replica.go:673  [replicaGC,s3,r1/18:/M{in-ax}] removed 40 (32+8) keys in 7ms [clear=0ms commit=7ms]
I170504 07:24:08.208158 43379 storage/replica_command.go:3537  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n3,s3):19): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n10,s10):10, (n4,s4):4, (n5,s5):5, (n6,s6):6, (n7,s7):7, (n8,s8):8, (n9,s9):9, next=19]
W170504 07:24:08.378010 37856 storage/node_liveness.go:291  [hb] slow heartbeat took 1.0s
W170504 07:24:08.544903 39114 storage/node_liveness.go:291  [hb] slow heartbeat took 1.2s
W170504 07:24:08.579623 38802 storage/node_liveness.go:291  [hb] slow heartbeat took 1.2s
W170504 07:24:08.709220 38579 storage/node_liveness.go:291  [hb] slow heartbeat took 1.5s
W170504 07:24:09.143717 39222 storage/node_liveness.go:291  [hb] slow heartbeat took 1.0s
W170504 07:24:09.435217 37856 storage/node_liveness.go:291  [hb] slow heartbeat took 1.1s
W170504 07:24:09.487704 43565 storage/replica.go:1730  [s1,r1/1:/M{in-ax}] context canceled while in command queue: [txn: 1943d8c7], Put [/Meta2/Max,/Min), Put [/Meta1/Max,/Min), EndTransaction [/Local/Range/""/RangeDescriptor,/Min)
W170504 07:24:09.506394 43379 internal/client/txn.go:460  [s1,r1/1:/M{in-ax}] failure aborting transaction: context canceled; abort caused by: context canceled
W170504 07:24:09.554966 39716 storage/node_liveness.go:291  [hb] slow heartbeat took 1.3s
I170504 07:24:09.564631 43795 util/stop/stopper.go:505  quiescing; tasks left:
2      kv/txn_coord_sender.go:752
13     storage/client_test.go:510
I170504 07:24:09.567714 43795 util/stop/stopper.go:505  quiescing; tasks left:
8      storage/client_test.go:510
1      kv/txn_coord_sender.go:752
I170504 07:24:09.567822 37856 storage/node_liveness.go:352  [hb] heartbeat result is ambiguous (sending to all 9 replicas failed; last error: failed to send RPC: store is stopped, but RPC failure may have masked txn commit); retrying
I170504 07:24:09.569678 39716 storage/node_liveness.go:352  [hb] heartbeat result is ambiguous (sending to all 9 replicas failed; last error: failed to send RPC: store is stopped, but RPC failure may have masked txn commit); retrying
I170504 07:24:09.571000 38457 storage/node_liveness.go:352  [hb] heartbeat result is ambiguous (sending to all 9 replicas failed; last error: failed to send RPC: store is stopped, but RPC failure may have masked txn commit); retrying
W170504 07:24:09.571588 38457 storage/node_liveness.go:253  [hb] failed node liveness heartbeat: node unavailable; try another peer
I170504 07:24:09.572316 38579 storage/node_liveness.go:352  [hb] heartbeat result is ambiguous (sending to all 9 replicas failed; last error: failed to send RPC: store is stopped, but RPC failure may have masked txn commit); retrying
W170504 07:24:09.572893 38579 storage/node_liveness.go:253  [hb] failed node liveness heartbeat: node unavailable; try another peer
W170504 07:24:09.572965 39716 storage/node_liveness.go:253  [hb] failed node liveness heartbeat: node unavailable; try another peer
I170504 07:24:09.573114 43795 util/stop/stopper.go:505  quiescing; tasks left:
7      storage/client_test.go:510
1      kv/txn_coord_sender.go:752
W170504 07:24:09.573255 37856 storage/node_liveness.go:253  [hb] failed node liveness heartbeat: node unavailable; try another peer
W170504 07:24:09.573611 43541 storage/replica.go:2287  [hb,s1,r1/1:/M{in-ax}] shutdown cancellation after 1.0s of attempting command [txn: 4863cfe9], BeginTransaction [/System/NodeLiveness/6,/Min), ConditionalPut [/System/NodeLiveness/6,/Min), EndTransaction [/System/NodeLiveness/6,/Min)
I170504 07:24:09.573902 43795 util/stop/stopper.go:505  quiescing; tasks left:
5      storage/client_test.go:510
1      kv/txn_coord_sender.go:752
I170504 07:24:09.581310 43795 util/stop/stopper.go:505  quiescing; tasks left:
2      storage/client_test.go:510
1      kv/txn_coord_sender.go:752
I170504 07:24:09.582212 39222 storage/node_liveness.go:352  [hb] heartbeat result is ambiguous (sending to all 9 replicas failed; last error: failed to send RPC: store is stopped, but RPC failure may have masked txn commit); retrying
W170504 07:24:09.582905 39222 storage/node_liveness.go:253  [hb] failed node liveness heartbeat: node unavailable; try another peer
I170504 07:24:09.582987 39445 storage/node_liveness.go:352  [hb] heartbeat result is ambiguous (sending to all 9 replicas failed; last error: failed to send RPC: store is stopped, but RPC failure may have masked txn commit); retrying
I170504 07:24:09.662506 39572 storage/node_liveness.go:352  [hb] heartbeat result is ambiguous (sending to all 9 replicas failed; last error: failed to send RPC: store is stopped, but RPC failure may have masked txn commit); retrying
W170504 07:24:09.662625 43518 storage/replica.go:2287  [hb,s1,r1/1:/M{in-ax}] shutdown cancellation after 0.9s of attempting command [txn: b10f94cc], BeginTransaction [/System/NodeLiveness/4,/Min), ConditionalPut [/System/NodeLiveness/4,/Min), EndTransaction [/System/NodeLiveness/4,/Min)
W170504 07:24:09.663048 39445 storage/node_liveness.go:253  [hb] failed node liveness heartbeat: node unavailable; try another peer
I170504 07:24:09.663194 43795 util/stop/stopper.go:505  quiescing; tasks left:
2      storage/client_test.go:510
W170504 07:24:09.663275 39572 storage/node_liveness.go:253  [hb] failed node liveness heartbeat: node unavailable; try another peer
W170504 07:24:09.663383 39114 storage/node_liveness.go:291  [hb] slow heartbeat took 1.0s
I170504 07:24:09.663559 39114 storage/node_liveness.go:352  [hb] heartbeat result is ambiguous (server shutdown); retrying
W170504 07:24:09.664318 39114 storage/node_liveness.go:253  [hb] failed node liveness heartbeat: node unavailable; try another peer
W170504 07:24:09.664432 38802 storage/node_liveness.go:291  [hb] slow heartbeat took 1.1s
I170504 07:24:09.664577 38802 storage/node_liveness.go:352  [hb] heartbeat result is ambiguous (server shutdown); retrying
W170504 07:24:09.665169 38802 storage/node_liveness.go:253  [hb] failed node liveness heartbeat: node unavailable; try another peer
W170504 07:24:09.721445 39855 storage/raft_transport.go:257  unable to accept Raft message from (n7,s7):7: no handler registered for (n1,s1):1
W170504 07:24:09.721851 39855 storage/raft_transport.go:257  unable to accept Raft message from (n7,s7):7: no handler registered for (n1,s1):1
W170504 07:24:09.722176 39729 storage/store.go:3153  [s7] raft error: node 1 claims to not contain store 1 for replica (n1,s1):1: store 1 was not found
W170504 07:24:09.722469 39727 storage/raft_transport.go:442  raft transport stream to node 1 failed: store 1 was not found
I170504 07:24:09.912445 38955 vendor/google.golang.org/grpc/transport/http2_server.go:323  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:44189->127.0.0.1:60111: use of closed network connection
I170504 07:24:09.913552 38833 vendor/google.golang.org/grpc/transport/http2_client.go:1231  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I170504 07:24:09.917164 39314 vendor/google.golang.org/grpc/transport/http2_server.go:323  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:51208->127.0.0.1:41712: use of closed network connection
I170504 07:24:09.917878 39008 vendor/google.golang.org/grpc/transport/http2_server.go:323  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:59291->127.0.0.1:45829: use of closed network connection
I170504 07:24:09.918310 38462 vendor/google.golang.org/grpc/transport/http2_server.go:323  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:50907->127.0.0.1:50339: use of closed network connection
I170504 07:24:09.924323 38137 vendor/google.golang.org/grpc/transport/http2_server.go:323  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:49614->127.0.0.1:43978: use of closed network connection
I170504 07:24:09.924453 38696 vendor/google.golang.org/grpc/transport/http2_client.go:1231  transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:50339->127.0.0.1:50907: read: connection reset by peer.
I170504 07:24:09.924568 39055 vendor/google.golang.org/grpc/transport/http2_client.go:1231  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I170504 07:24:09.924693 39299 vendor/google.golang.org/grpc/transport/http2_client.go:1231  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I170504 07:24:09.925086 38491 vendor/google.golang.org/grpc/transport/http2_server.go:323  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:45960->127.0.0.1:37819: use of closed network connection
W170504 07:24:09.947277 39329 gossip/gossip.go:1189  [n10] no incoming or outgoing connections
I170504 07:24:09.947538 38486 vendor/google.golang.org/grpc/transport/http2_client.go:1231  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I170504 07:24:09.949194 38489 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:49614: getsockopt: connection refused"; Reconnecting to {127.0.0.1:49614 <nil>}
I170504 07:24:09.964014 38588 vendor/google.golang.org/grpc/transport/http2_client.go:1231  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I170504 07:24:09.968063 43828 vendor/google.golang.org/grpc/transport/http2_client.go:1231  transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:60233->127.0.0.1:44189: read: connection reset by peer.
I170504 07:24:09.968845 38591 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:45960: operation was canceled"; Reconnecting to {127.0.0.1:45960 <nil>}
I170504 07:24:09.973543 38591 vendor/google.golang.org/grpc/clientconn.go:906  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I170504 07:24:09.975743 39138 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:59291: getsockopt: connection refused"; Reconnecting to {127.0.0.1:59291 <nil>}
I170504 07:24:09.975883 39390 vendor/google.golang.org/grpc/transport/http2_server.go:323  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:53797->127.0.0.1:54368: use of closed network connection
W170504 07:24:09.976058 39965 storage/raft_transport.go:442  raft transport stream to node 7 failed: rpc error: code = Internal desc = transport is closing
I170504 07:24:09.976270 39466 vendor/google.golang.org/grpc/transport/http2_client.go:1231  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I170504 07:24:09.978879 38996 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:44189: getsockopt: connection refused"; Reconnecting to {127.0.0.1:44189 <nil>}
I170504 07:24:09.979211 38733 vendor/google.golang.org/grpc/transport/http2_server.go:323  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:34089->127.0.0.1:37053: use of closed network connection
I170504 07:24:09.979395 38323 vendor/google.golang.org/grpc/transport/http2_server.go:323  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:44469->127.0.0.1:55617: use of closed network connection
I170504 07:24:09.979531 38369 vendor/google.golang.org/grpc/transport/http2_client.go:1231  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I170504 07:24:09.980705 39302 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:51208: getsockopt: connection refused"; Reconnecting to {127.0.0.1:51208 <nil>}
I170504 07:24:09.980842 38838 vendor/google.golang.org/grpc/transport/http2_client.go:1231  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I170504 07:24:09.980951 39383 vendor/google.golang.org/grpc/transport/http2_server.go:323  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:58810->127.0.0.1:49890: use of closed network connection
I170504 07:24:09.981084 39490 vendor/google.golang.org/grpc/transport/http2_client.go:1231  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I170504 07:24:09.981690 38841 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:34089: operation was canceled"; Reconnecting to {127.0.0.1:34089 <nil>}
I170504 07:24:09.981801 38699 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:50907: getsockopt: connection refused"; Reconnecting to {127.0.0.1:50907 <nil>}
I170504 07:24:09.983029 38841 vendor/google.golang.org/grpc/clientconn.go:906  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
W170504 07:24:09.983117 39956 storage/raft_transport.go:442  raft transport stream to node 6 failed: rpc error: code = Internal desc = transport is closing
I170504 07:24:09.983475 39302 vendor/google.golang.org/grpc/clientconn.go:906  grpc: addrConn.transportMonitor exits due to: context canceled
I170504 07:24:09.983765 39493 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:58810: operation was canceled"; Reconnecting to {127.0.0.1:58810 <nil>}
I170504 07:24:09.983963 39493 vendor/google.golang.org/grpc/clientconn.go:906  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I170504 07:24:09.984053 38372 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:44469: getsockopt: connection refused"; Reconnecting to {127.0.0.1:44469 <nil>}
I170504 07:24:09.985688 38996 vendor/google.golang.org/grpc/clientconn.go:906  grpc: addrConn.transportMonitor exits due to: context canceled
I170504 07:24:09.988278 38699 vendor/google.golang.org/grpc/clientconn.go:906  grpc: addrConn.transportMonitor exits due to: context canceled
I170504 07:24:09.988437 38489 vendor/google.golang.org/grpc/clientconn.go:906  grpc: addrConn.transportMonitor exits due to: context canceled
I170504 07:24:09.988531 38372 vendor/google.golang.org/grpc/clientconn.go:906  grpc: addrConn.transportMonitor exits due to: context canceled
I170504 07:24:09.988661 39138 vendor/google.golang.org/grpc/clientconn.go:906  grpc: addrConn.transportMonitor exits due to: context canceled
I170504 07:24:09.994130 39469 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:53797: getsockopt: connection refused"; Reconnecting to {127.0.0.1:53797 <nil>}
I170504 07:24:09.994302 39469 vendor/google.golang.org/grpc/clientconn.go:906  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
    client_test.go:1027: [n1,s1,r1/1:/M{in-ax}]: pending snapshot already present: 238
a-robinson commented 7 years ago

This doesn't look particularly scary, but it does appear to be new. I'll be curious to see whether it happens again.

tamird commented 7 years ago

Bisected this to f67a508828a3faf667df1c2ebb72a77c23f2f525! Reproduces quite readily on my azworker with build/builder.sh make stress PKG=./pkg/storage TESTS=TestRaftRemoveRace STRESSFLAGS='-p 200'.

cockroach-teamcity commented 7 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/4e68800f6b47c22e142b48fdb390a0e5d249a8cf

Parameters:

TAGS=
GOFLAGS=-race

Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=253814&tab=buildLog

cockroach-teamcity commented 7 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/11987dc7e2f20f6417925c638f45c2067435b17f

Parameters:

TAGS=
GOFLAGS=-race

Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=256725&tab=buildLog

cockroach-teamcity commented 7 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/7fb6d983651d77cbd200f3553ce842a82fcf30ff

Parameters:

TAGS=
GOFLAGS=-race

Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=257860&tab=buildLog

tamird commented 7 years ago

[03:26:27][Step 2/2] client_test.go:1027: [n1,s1,r1/1:/M{in-ax}]: can't set pending snapshot index to 1; pending snapshot already present: 141

tamird commented 7 years ago

Reassigning to @petermattis

petermattis commented 7 years ago

Reproduces readily enough:

~ make stressrace PKG=./storage/ TESTS=TestRaftRemoveRace
...
--- FAIL: TestRaftRemoveRace (8.56s)
    client_test.go:1027: [n1,s1,r1/1:/M{in-ax}]: can't set pending snapshot index to 1; pending snapshot already present: 119
FAIL

ERROR: exit status 1

9 runs completed, 1 failures, over 1m0s
FAIL
petermattis commented 7 years ago

I'm seeing what appears to be an incorrect retry. The test is executing a series of AdminChangeReplicas operations, adding and removing the same replica over and over. The range is replicated to 10 nodes. At some point, the AdminChangeReplicas operation takes too long and we see:

I170526 14:05:54.162334 9 kv/dist_sender.go:1155  r1: sending batch 1 AdmChangeReplicas to (n1,s1):1
I170526 14:05:54.679806 9 kv/dist_sender.go:1180  timeout, trying next peer: (n2,s2):2
I170526 14:05:54.796071 9 kv/dist_sender.go:1312  error, trying next peer: (n10,s10):10
I170526 14:05:54.797355 9 kv/dist_sender.go:1312  error, trying next peer: (n4,s4):4
I170526 14:05:54.798431 9 kv/dist_sender.go:1312  error, trying next peer: (n5,s5):5
I170526 14:05:54.799595 9 kv/dist_sender.go:1312  error, trying next peer: (n6,s6):6
I170526 14:05:54.829640 9 kv/dist_sender.go:1312  error, trying next peer: (n7,s7):7
I170526 14:05:54.888469 9 kv/dist_sender.go:1312  error, trying next peer: (n8,s8):8
I170526 14:05:54.921464 9 kv/dist_sender.go:1312  error, trying next peer: (n9,s9):9
I170526 14:05:54.922616 9 kv/dist_sender.go:1312  error, trying next peer: (n3,s3):13

Not shown is that the last error is RangeNotFound. This causes DistSender.sendToReplicas to return (even though there is still a pending RPC to n1). DistSender.sendPartialBatch then merrily retries the AdminChangeReplicas and we send the same operation to n1 again and hit the above error.

Two things seem suspicious here:

  1. The 500ms SendNextTimeout is way to short for something like AdminChangeReplicas.
  2. DistSender.sendToReplicas shouldn't return on the first RangeNotFoundError if there is still an outstanding RPC.

@spencerkimball This is in your area of expertise (or, at least, you touched it recently).

@tamird Weren't you looking at the DistSender timeouts recently?

petermattis commented 7 years ago

Also @bdarnell, since you've had thoughts about timeouts recently.

tamird commented 7 years ago

I did look at the timeouts, but ended up only increasing the abandonment timeout (#16088), which shouldn't affect this scenario given your description above.

There seems to be growing doubt about the usefulness of SendNextTimeout, and in general the idea of sending RPCs concurrently to multiple replicas. It's pretty subtle to get right (and we haven't), especially in the absence of compiler assistance (e.g. pattern matching exhaustiveness checking would force us to deal with every error type). Perhaps we should prototype removing these concurrent RPCs and measure the effect on performance (I'd expect tail latency to suffer somewhat).

tamird commented 7 years ago

Ah, actually this might have been exacerbated by #16088 which also introduced not waiting on outstanding RPCs if there's no EndTransaction in the batch (https://github.com/cockroachdb/cockroach/pull/16088#discussion_r118290624).

bdarnell commented 7 years ago

For context, removing SendNextTimeout has been discussed in https://github.com/cockroachdb/cockroach/issues/16119#issuecomment-304400427 and https://github.com/cockroachdb/cockroach/issues/6719#issuecomment-283468686. SendNextTimeout was set to its current value in #6728, after we (incorrectly) concluded that #6688 had resolved all the issues with this mechanism.

It's pretty subtle to get right (and we haven't)

It's certainly fragile and easy to break (and we have), but with the waiting restored in #16181 I think things are back in working order. Are there any remaining known issues?

Either way, this feature is causing a lot of trouble for something that we have a hard time even articulating the benefit of. We should increase base.DefaultSendNextTimeout to a high value and see what that does to tail latencies. I expect that will show it's safe to remove, and/or give us ideas about less tricky alternatives.

spencerkimball commented 7 years ago

10m is a very high timeout. This might break the Jepsen tests, where we routinely see an RPC lost in a network partition, a new Raft leader elected, and the code in DistSender is expected to retry and find that new leader to complete the original request. 10s would be far more appropriate.

bdarnell commented 7 years ago

10m is intended to test the hypothesis that we can remove this timeout completely - if setting it to a high value causes problems, we want to know about it so we can see if there are better ways of handling them than this timeout. In the network partition scenario you describe, the GRPC connection health checking should abort the connection well before we'd hit a 10s timeout.

petermattis commented 7 years ago

This hasn't been seen since #16181.