cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/dadcaa493bca02996ebd87c0c85106e566893ce2

Parameters:

TAGS=
GOFLAGS=

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestReportUsage PKG=github.com/cockroachdb/cockroach/pkg/server TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1084076&tab=buildLog

I190108 06:16:14.571880 42468 server/server.go:1561  [n1] starting grpc/postgres server at 127.0.0.1:39049
I190108 06:16:14.571890 42468 server/server.go:1562  [n1] advertising CockroachDB node at 127.0.0.1:39049
W190108 06:16:14.571912 42468 jobs/registry.go:316  [n1] unable to get node liveness: node not in the liveness table
I190108 06:16:14.645072 42915 sql/event_log.go:135  [n1,intExec=optInToDiagnosticsStatReporting] Event: "set_cluster_setting", target: 0, info: {SettingName:diagnostics.reporting.enabled Value:true User:root}
I190108 06:16:14.656014 42921 sql/event_log.go:135  [n1,intExec=set-setting] Event: "set_cluster_setting", target: 0, info: {SettingName:version Value:2.1-3 User:root}
I190108 06:16:14.662932 42439 sql/event_log.go:135  [n1,intExec=disableNetTrace] Event: "set_cluster_setting", target: 0, info: {SettingName:trace.debug.enable Value:false User:root}
I190108 06:16:14.677147 42931 sql/event_log.go:135  [n1,intExec=initializeClusterSecret] Event: "set_cluster_setting", target: 0, info: {SettingName:cluster.secret Value:0922ef81-f214-4450-b846-fcd020d61a5a User:root}
I190108 06:16:14.686199 42938 sql/event_log.go:135  [n1,intExec=create-default-db] Event: "create_database", target: 50, info: {DatabaseName:defaultdb Statement:CREATE DATABASE IF NOT EXISTS defaultdb User:root}
I190108 06:16:14.690598 42943 sql/event_log.go:135  [n1,intExec=create-default-db] Event: "create_database", target: 51, info: {DatabaseName:postgres Statement:CREATE DATABASE IF NOT EXISTS postgres User:root}
I190108 06:16:14.693169 42468 server/server.go:1620  [n1] done ensuring all necessary migrations have run
I190108 06:16:14.693227 42468 server/server.go:1623  [n1] serving sql connections
I190108 06:16:14.696910 42455 storage/replica_command.go:349  [n1,split,s1,r19/1:/{Table/23-Max}] initiating a split of this range at key /Table/24 [r20]
I190108 06:16:14.697915 42953 sql/event_log.go:135  [n1] Event: "node_join", target: 1, info: {Descriptor:{NodeID:1 Address:127.0.0.1:39049 Attrs: Locality:region=east,zone=somestring,state=ny,city=nyc ServerVersion:2.1-3 BuildTag:v2.2.0-alpha.20181217-468-gdadcaa4 StartedAt:1546928174567838181 LocalityAddress:[] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ClusterID:24df0eb7-da1d-48e7-89e8-e4eb98e8aabf StartedAt:1546928174567838181 LastUp:1546928174567838181}
I190108 06:16:14.708643 42959 sql/event_log.go:135  [n1,intExec=set-version] Event: "set_cluster_setting", target: 0, info: {SettingName:version Value:2.1-3 User:root}
I190108 06:16:14.708746 42950 server/server_update.go:86  [n1] successfully upgraded cluster version
I190108 06:16:14.718221 42310 sql/event_log.go:135  [n1,client=127.0.0.1:33440,user=root] Event: "create_database", target: 52, info: {DatabaseName:somestring Statement:CREATE DATABASE somestring User:root}
I190108 06:16:14.722599 42310 sql/event_log.go:135  [n1,client=127.0.0.1:33440,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:server.time_until_store_dead Value:00:01:30 User:root}
I190108 06:16:14.726783 42310 sql/event_log.go:135  [n1,client=127.0.0.1:33440,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:diagnostics.reporting.send_crash_reports Value:false User:root}
I190108 06:16:14.728973 42468 util/stop/stopper.go:536  quiescing; tasks left:
2      [async] closedts-rangefeed-subscriber
1      node.Node: writing summary
W190108 06:16:14.729228 42884 server/node.go:845  [n1,summaries] error recording initial status summaries: failed to send RPC: sending to all 1 replicas failed; last error: (err: node unavailable; try another peer) <nil>
I190108 06:16:14.729260 42468 util/stop/stopper.go:536  quiescing; tasks left:
2      [async] closedts-rangefeed-subscriber
I190108 06:16:14.729323 42468 util/stop/stopper.go:536  quiescing; tasks left:
1      [async] closedts-rangefeed-subscriber
--- FAIL: TestReportUsage (0.20s)
    updates_test.go:265: error applying zone config "constraints: [+zone=somestring, +somestring]" to "TABLE system.rangelog": pq: constraint "+zone=somestring" matches no existing nodes within the cluster - did you enter it correctly?

tbg commented 5 years ago

@andreimatei suppose we're not waiting where we used to wait before.

It also seems that this would hit users if they ran scripts with pretty tight timings.

andreimatei commented 5 years ago

Can't repro for the life in me; tried stressrace, stress, different -p, even stressing the whole package for a while.

But, I've done some digging. The complaint is that no node had the zone=somestring label, even though that's part of the Locality object with which the "server" is created. A node writes its locality as part of the node status record. The way in which the node status is written is perhaps funky. It's written synchronously by Node.bootstrapStores(), but that method is not called for the node bootstrapping a cluster. In that other code path (everything around the "bootstrapping" code is so confusing) the node status is not written synchronously. I think it should. Or perhaps the code paths should be unified more. There's also a task that periodically writes that node status. The failure logs show that, after the test failed and the stopper started quiescing, that task was in the process of writing the record for the first time and that write failed (because of the quiescing). So this supports the theory that the record was just not written in time.

I'll do something.

tbg commented 5 years ago

Repros immediately with this diff:

diff --git a/pkg/server/status/recorder.go b/pkg/server/status/recorder.go
index d5c680ac13..4dd78e35c1 100644
--- a/pkg/server/status/recorder.go
+++ b/pkg/server/status/recorder.go
@@ -500,6 +500,7 @@ func (mr *MetricsRecorder) GenerateNodeStatus(ctx context.Context) *statuspb.Nod
 func (mr *MetricsRecorder) WriteNodeStatus(
        ctx context.Context, db *client.DB, nodeStatus statuspb.NodeStatus,
 ) error {
+       time.Sleep(time.Second)
        mr.writeSummaryMu.Lock()
        defer mr.writeSummaryMu.Unlock()
        key := keys.NodeStatusKey(nodeStatus.Desc.NodeID)

I agree that we should make sure the node status is there before a node "finishes" its startup sequence, at least when used with the --background flag, but also so that tests "just work". (This problem has cropped up repeatedly, I think this is also the root cause in https://github.com/cockroachdb/cockroach/pull/33442 which could potentially hit prod customers)

andreimatei commented 5 years ago

33442 "fixed" #33301, in which you wrote: "this seems to eventually rely on (*MetricsRecorder.WriteNodeStatus) (for the NodesRequest) and on there being a healthy RPC connection to each node, and on each node being marked as live (for CheckNodeHealthAndVersion)."

Do you have any thoughts on the establishing of connections and the "marking the node as live" parts - I guess whether a node joining the cluster should wait for these things in Server.Start()? I think the writing of the liveness record is probably already synchronous, but I'm not sure.

tbg commented 5 years ago

It might be enough to hoist the first write out of the worker to address the node status issue, but I think that'll probably just give you other problems (like blocking the startup goroutine with a write that can't succeed until the startup goroutine has done $SOMETHING_ELSE; not sure).

https://github.com/cockroachdb/cockroach/blob/352bdf4630d5e16b6200237acaaaca22f7470511/pkg/server/node.go#L844-L846

I don't think we write liveness synchronously: https://github.com/cockroachdb/cockroach/blob/d591a49931a665f896b064363eddd975281ab197/pkg/storage/node_liveness.go#L404-L467

Perhaps both of these should register themselves with a WaitGroup so that we have something to block on before we announce the server "ready".

Waiting for RPC connections to all nodes is hopefully not necessary, for this could pick up a node that gets disconnected and then we'd be stuck on it for a good reason. I don't expect that we'll need to add this in the near future.

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/40e403544d60b1a44b8b1ed961a817c77d67aa31

Parameters:

TAGS=
GOFLAGS=

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestReportUsage PKG=github.com/cockroachdb/cockroach/pkg/server TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1131516&tab=buildLog

8      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-rangefeed-subscriber
1      storage.IntentResolver: processing intents
1      node.Node: writing summary
1      node.Node: batch
1      [async] intent_resolver_batcher
W190209 06:24:48.385834 44155 storage/intentresolver/intent_resolver.go:721  failed to gc transaction record: could not GC completed transaction anchored at /Table/SystemConfigSpan/Start: node unavailable; try another peer
I190209 06:24:48.385880 43671 util/stop/stopper.go:536  quiescing; tasks left:
8      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-rangefeed-subscriber
1      storage.IntentResolver: processing intents
1      node.Node: writing summary
1      node.Node: batch
I190209 06:24:48.385913 43671 util/stop/stopper.go:536  quiescing; tasks left:
7      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-rangefeed-subscriber
1      storage.IntentResolver: processing intents
1      node.Node: writing summary
1      node.Node: batch
I190209 06:24:48.385970 43671 util/stop/stopper.go:536  quiescing; tasks left:
6      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-rangefeed-subscriber
1      storage.IntentResolver: processing intents
1      node.Node: writing summary
1      node.Node: batch
I190209 06:24:48.386016 43671 util/stop/stopper.go:536  quiescing; tasks left:
5      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-rangefeed-subscriber
1      storage.IntentResolver: processing intents
1      node.Node: writing summary
1      node.Node: batch
I190209 06:24:48.386299 43671 util/stop/stopper.go:536  quiescing; tasks left:
4      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-rangefeed-subscriber
1      storage.IntentResolver: processing intents
1      node.Node: writing summary
1      node.Node: batch
I190209 06:24:48.386359 43671 util/stop/stopper.go:536  quiescing; tasks left:
3      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-rangefeed-subscriber
1      storage.IntentResolver: processing intents
1      node.Node: writing summary
1      node.Node: batch
I190209 06:24:48.386403 43671 util/stop/stopper.go:536  quiescing; tasks left:
2      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-rangefeed-subscriber
1      storage.IntentResolver: processing intents
1      node.Node: writing summary
1      node.Node: batch
I190209 06:24:48.386452 43671 util/stop/stopper.go:536  quiescing; tasks left:
2      [async] closedts-rangefeed-subscriber
1      storage.IntentResolver: processing intents
1      node.Node: writing summary
1      node.Node: batch
1      [async] storage.IntentResolver: cleanup txn records
I190209 06:24:48.386500 43671 util/stop/stopper.go:536  quiescing; tasks left:
2      [async] closedts-rangefeed-subscriber
1      storage.IntentResolver: processing intents
1      node.Node: writing summary
1      node.Node: batch
W190209 06:24:48.386548 43713 server/node.go:842  [n1,summaries] error recording initial status summaries: failed to send RPC: sending to all 1 replicas failed; last error: (err: [NotLeaseHolderError] r3: replica (n1,s1):1 not lease holder; lease holder unknown) <nil>
I190209 06:24:48.386575 43671 util/stop/stopper.go:536  quiescing; tasks left:
2      [async] closedts-rangefeed-subscriber
1      storage.IntentResolver: processing intents
1      node.Node: batch
I190209 06:24:48.392551 43671 util/stop/stopper.go:536  quiescing; tasks left:
1      storage.IntentResolver: processing intents
1      node.Node: batch
1      [async] closedts-rangefeed-subscriber
I190209 06:24:48.392580 43671 util/stop/stopper.go:536  quiescing; tasks left:
1      storage.IntentResolver: processing intents
1      node.Node: batch
--- FAIL: TestReportUsage (0.21s)
    updates_test.go:265: error applying zone config "constraints: [+zone=somestring, +somestring]" to "TABLE system.rangelog": pq: constraint "+zone=somestring" matches no existing nodes within the cluster - did you enter it correctly?

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/2e16b7357d5a15d87cd284d5f2c12e424ed29ca1

Parameters:

TAGS=
GOFLAGS=

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestReportUsage PKG=github.com/cockroachdb/cockroach/pkg/server TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1146384&tab=buildLog

I190221 06:28:37.416172 45057 sql/event_log.go:135  [n1] Event: "node_join", target: 1, info: {Descriptor:{NodeID:1 Address:127.0.0.1:34441 Attrs: Locality:region=east,zone=somestring,state=ny,city=nyc ServerVersion:2.1-8 BuildTag:v2.2.0-alpha.20190211-271-g2e16b73 StartedAt:1550730517267623874 LocalityAddress:[] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ClusterID:b3a45ada-bad9-46d9-ac0d-902767fe4d36 StartedAt:1550730517267623874 LastUp:1550730517267623874}
I190221 06:28:37.425686 45054 server/server_update.go:64  [n1] failed attempt to upgrade cluster version, error: no live nodes found
I190221 06:28:37.428948 44974 sql/event_log.go:135  [n1,client=127.0.0.1:48672,user=root] Event: "create_database", target: 52, info: {DatabaseName:somestring Statement:CREATE DATABASE somestring User:root}
I190221 06:28:37.432537 44974 sql/event_log.go:135  [n1,client=127.0.0.1:48672,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:server.time_until_store_dead Value:00:01:30 User:root}
I190221 06:28:37.437019 44974 sql/event_log.go:135  [n1,client=127.0.0.1:48672,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:diagnostics.reporting.send_crash_reports Value:false User:root}
I190221 06:28:37.438876 44469 util/stop/stopper.go:547  quiescing; tasks left:
6      [async] storage.IntentResolver: cleanup txn records
3      storage.IntentResolver: processing intents
2      [async] intent_resolver_ir_batcher
2      [async] intent_resolver_gc_batcher
2      [async] closedts-rangefeed-subscriber
1      node.Node: writing summary
1      [async] auto-upgrade
I190221 06:28:37.438915 44469 util/stop/stopper.go:547  quiescing; tasks left:
6      [async] storage.IntentResolver: cleanup txn records
3      storage.IntentResolver: processing intents
2      [async] intent_resolver_ir_batcher
2      [async] intent_resolver_gc_batcher
2      [async] closedts-rangefeed-subscriber
1      node.Node: writing summary
I190221 06:28:37.439099 44469 util/stop/stopper.go:547  quiescing; tasks left:
6      [async] storage.IntentResolver: cleanup txn records
3      storage.IntentResolver: processing intents
2      [async] closedts-rangefeed-subscriber
1      node.Node: writing summary
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
W190221 06:28:37.439213 45030 storage/intentresolver/intent_resolver.go:821  failed to gc transaction record: could not GC completed transaction anchored at /Table/SystemConfigSpan/Start: node unavailable; try another peer
I190221 06:28:37.439243 44469 util/stop/stopper.go:547  quiescing; tasks left:
5      [async] storage.IntentResolver: cleanup txn records
3      storage.IntentResolver: processing intents
2      [async] closedts-rangefeed-subscriber
1      node.Node: writing summary
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
I190221 06:28:37.439285 44469 util/stop/stopper.go:547  quiescing; tasks left:
4      [async] storage.IntentResolver: cleanup txn records
3      storage.IntentResolver: processing intents
2      [async] closedts-rangefeed-subscriber
1      node.Node: writing summary
I190221 06:28:37.439312 44469 util/stop/stopper.go:547  quiescing; tasks left:
3      storage.IntentResolver: processing intents
2      [async] closedts-rangefeed-subscriber
1      node.Node: writing summary
W190221 06:28:37.439340 44477 server/node.go:839  [n1,summaries] error recording initial status summaries: failed to send RPC: sending to all 1 replicas failed; last error: (err: [NotLeaseHolderError] r3: replica (n1,s1):1 not lease holder; lease holder unknown) <nil>
I190221 06:28:37.439359 44469 util/stop/stopper.go:547  quiescing; tasks left:
2      [async] closedts-rangefeed-subscriber
I190221 06:28:37.439420 44469 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] closedts-rangefeed-subscriber
--- FAIL: TestReportUsage (0.21s)
    updates_test.go:265: error applying zone config "constraints: [+zone=somestring, +somestring]" to "TABLE system.rangelog": pq: constraint "+zone=somestring" matches no existing nodes within the cluster - did you enter it correctly?

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/9938cb1a2cca4c0350244f76845f0c61391d44a7

Parameters:

TAGS=
GOFLAGS=

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestReportUsage PKG=github.com/cockroachdb/cockroach/pkg/server TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1249609&tab=buildLog

W190419 05:59:22.419078 46652 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I190419 05:59:22.440041 46794 sql/event_log.go:135  [n1,intExec=optInToDiagnosticsStatReporting] Event: "set_cluster_setting", target: 0, info: {SettingName:diagnostics.reporting.enabled Value:true User:root}
I190419 05:59:22.454469 46394 sql/event_log.go:135  [n1,intExec=set-setting] Event: "set_cluster_setting", target: 0, info: {SettingName:version Value:19.1 User:root}
I190419 05:59:22.465160 46401 sql/event_log.go:135  [n1,intExec=disableNetTrace] Event: "set_cluster_setting", target: 0, info: {SettingName:trace.debug.enable Value:false User:root}
I190419 05:59:22.479362 46807 sql/event_log.go:135  [n1,intExec=initializeClusterSecret] Event: "set_cluster_setting", target: 0, info: {SettingName:cluster.secret Value:6689437e-6ada-42f1-923c-1213932e1191 User:root}
I190419 05:59:22.484648 46785 sql/event_log.go:135  [n1,intExec=create-default-db] Event: "create_database", target: 50, info: {DatabaseName:defaultdb Statement:CREATE DATABASE IF NOT EXISTS defaultdb User:root}
I190419 05:59:22.487564 46870 sql/event_log.go:135  [n1,intExec=create-default-db] Event: "create_database", target: 51, info: {DatabaseName:postgres Statement:CREATE DATABASE IF NOT EXISTS postgres User:root}
I190419 05:59:22.493049 46022 server/server.go:1650  [n1] done ensuring all necessary migrations have run
I190419 05:59:22.493078 46022 server/server.go:1653  [n1] serving sql connections
I190419 05:59:22.495459 46879 server/server_update.go:64  [n1] failed attempt to upgrade cluster version, error: no live nodes found
I190419 05:59:22.495519 46882 sql/event_log.go:135  [n1] Event: "node_join", target: 1, info: {Descriptor:{NodeID:1 Address:127.0.0.1:41659 Attrs: Locality:region=east,zone=somestring,state=ny,city=nyc ServerVersion:19.1 BuildTag:v19.1.0-rc.3-17-g9938cb1 StartedAt:1555653562364316878 LocalityAddress:[] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ClusterID:89a04112-d05f-4892-9195-7587ad32ad08 StartedAt:1555653562364316878 LastUp:1555653562364316878}
I190419 05:59:22.510599 46885 sql/event_log.go:135  [n1,client=127.0.0.1:52332,user=root] Event: "create_database", target: 52, info: {DatabaseName:somestring Statement:CREATE DATABASE somestring User:root}
I190419 05:59:22.514900 46885 sql/event_log.go:135  [n1,client=127.0.0.1:52332,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:server.time_until_store_dead Value:00:01:30 User:root}
I190419 05:59:22.520476 46885 sql/event_log.go:135  [n1,client=127.0.0.1:52332,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:diagnostics.reporting.send_crash_reports Value:false User:root}
I190419 05:59:22.523407 46022 util/stop/stopper.go:546  quiescing; tasks left:
8      [async] storage.IntentResolver: cleanup txn records
2      [async] intent_resolver_ir_batcher
2      [async] intent_resolver_gc_batcher
2      [async] closedts-rangefeed-subscriber
1      storage.IntentResolver: processing intents
1      node.Node: writing summary
1      [async] auto-upgrade
W190419 05:59:22.523443 46747 server/node.go:839  [n1,summaries] error recording initial status summaries: failed to send RPC: sending to all 1 replicas failed; last error: (err: [NotLeaseHolderError] r3: replica (n1,s1):1 not lease holder; lease holder unknown) <nil>
W190419 05:59:22.523456 46896 storage/intentresolver/intent_resolver.go:822  failed to gc transaction record: could not GC completed transaction anchored at /Table/SystemConfigSpan/Start: node unavailable; try another peer
I190419 05:59:22.523741 46022 util/stop/stopper.go:546  quiescing; tasks left:
2      [async] closedts-rangefeed-subscriber
I190419 05:59:22.523940 46022 util/stop/stopper.go:546  quiescing; tasks left:
1      [async] closedts-rangefeed-subscriber
--- FAIL: TestReportUsage (0.20s)
    updates_test.go:404: error applying zone config "constraints: [+zone=somestring, +somestring]" to "TABLE system.rangelog": pq: constraint "+zone=somestring" matches no existing nodes within the cluster - did you enter it correctly?

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/be35db9bddf0739a3971d4095ce19d896e251738

Parameters:

TAGS=
GOFLAGS=-parallel=4

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestReportUsage PKG=github.com/cockroachdb/cockroach/pkg/server TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1262398&tab=buildLog

I190426 05:53:07.181128 81419 server/server.go:1587  [n1] starting grpc/postgres server at 127.0.0.1:45999
I190426 05:53:07.181144 81419 server/server.go:1588  [n1] advertising CockroachDB node at 127.0.0.1:45999
I190426 05:53:07.244248 81898 sql/event_log.go:135  [n1,intExec=optInToDiagnosticsStatReporting] Event: "set_cluster_setting", target: 0, info: {SettingName:diagnostics.reporting.enabled Value:true User:root}
I190426 05:53:07.256440 81882 sql/event_log.go:135  [n1,intExec=set-setting] Event: "set_cluster_setting", target: 0, info: {SettingName:version Value:19.1-1 User:root}
I190426 05:53:07.265556 82006 sql/event_log.go:135  [n1,intExec=disableNetTrace] Event: "set_cluster_setting", target: 0, info: {SettingName:trace.debug.enable Value:false User:root}
I190426 05:53:07.280824 82186 sql/event_log.go:135  [n1,intExec=initializeClusterSecret] Event: "set_cluster_setting", target: 0, info: {SettingName:cluster.secret Value:ed1ff83b-9c04-465a-bfc6-7a9deb31455e User:root}
I190426 05:53:07.288256 82275 sql/event_log.go:135  [n1,intExec=create-default-db] Event: "create_database", target: 50, info: {DatabaseName:defaultdb Statement:CREATE DATABASE IF NOT EXISTS defaultdb User:root}
I190426 05:53:07.292035 82265 sql/event_log.go:135  [n1,intExec=create-default-db] Event: "create_database", target: 51, info: {DatabaseName:postgres Statement:CREATE DATABASE IF NOT EXISTS postgres User:root}
I190426 05:53:07.299506 81419 server/server.go:1653  [n1] done ensuring all necessary migrations have run
I190426 05:53:07.299545 81419 server/server.go:1656  [n1] serving sql connections
I190426 05:53:07.302606 82437 sql/event_log.go:135  [n1] Event: "node_join", target: 1, info: {Descriptor:{NodeID:1 Address:127.0.0.1:45999 Attrs: Locality:region=east,zone=somestring,state=ny,city=nyc ServerVersion:19.1-1 BuildTag:v19.2.0-alpha.00000000-111-gbe35db9 StartedAt:1556257987159053737 LocalityAddress:[] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ClusterID:2f691a90-d837-4a7b-896f-6c8adfdfaccb StartedAt:1556257987159053737 LastUp:1556257987159053737}
I190426 05:53:07.302805 82434 server/server_update.go:64  [n1] failed attempt to upgrade cluster version, error: no live nodes found
I190426 05:53:07.314419 82193 sql/event_log.go:135  [n1,client=127.0.0.1:41286,user=root] Event: "create_database", target: 52, info: {DatabaseName:somestring Statement:CREATE DATABASE somestring User:root}
I190426 05:53:07.318240 82193 sql/event_log.go:135  [n1,client=127.0.0.1:41286,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:server.time_until_store_dead Value:00:01:30 User:root}
I190426 05:53:07.322950 82193 sql/event_log.go:135  [n1,client=127.0.0.1:41286,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:diagnostics.reporting.send_crash_reports Value:false User:root}
I190426 05:53:07.328343 81419 util/stop/stopper.go:546  quiescing; tasks left:
9      [async] storage.IntentResolver: cleanup txn records
2      [async] intent_resolver_ir_batcher
2      [async] intent_resolver_gc_batcher
2      [async] closedts-rangefeed-subscriber
1      [async] auto-upgrade
I190426 05:53:07.328406 81419 util/stop/stopper.go:546  quiescing; tasks left:
9      [async] storage.IntentResolver: cleanup txn records
2      [async] intent_resolver_ir_batcher
2      [async] intent_resolver_gc_batcher
2      [async] closedts-rangefeed-subscriber
W190426 05:53:07.328469 82423 storage/intentresolver/intent_resolver.go:823  failed to gc transaction record: could not GC completed transaction anchored at /Table/SystemConfigSpan/Start: node unavailable; try another peer
I190426 05:53:07.328569 81419 util/stop/stopper.go:546  quiescing; tasks left:
2      [async] closedts-rangefeed-subscriber
I190426 05:53:07.328884 81419 util/stop/stopper.go:546  quiescing; tasks left:
1      [async] closedts-rangefeed-subscriber
--- FAIL: TestReportUsage (0.20s)
    updates_test.go:404: error applying zone config "constraints: [+zone=somestring, +somestring]" to "TABLE system.rangelog": pq: constraint "+zone=somestring" matches no existing nodes within the cluster - did you enter it correctly?

andreimatei commented 5 years ago

Also seen in #36990.

cockroachdb / cockroach

server: TestReportUsage failed under stress #33559

33442 "fixed" #33301, in which you wrote: "this seems to eventually rely on (*MetricsRecorder.WriteNodeStatus) (for the NodesRequest) and on there being a healthy RPC connection to each node, and on each node being marked as live (for CheckNodeHealthAndVersion)."