pkg/ccl/kvccl/kvtenantccl/upgradeinterlockccl/upgradeinterlockccl_test: TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances failed

cockroach-teamcity commented 3 months ago

pkg/ccl/kvccl/kvtenantccl/upgradeinterlockccl/upgradeinterlockccl_test.TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances failed on release-24.1 @ 63e816eb6192508225186850d65116fe219c1cd3:

=== RUN   TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances1619131486
    test_log_scope.go:81: use -show-logs to present logs inline
    local_test_util_test.go:52: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) upgrade interlock test: running variant "lagging_binary_version", configuration: "pause_after_second_check_of_instances"
    local_test_util_test.go:176: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) creating an initial tenant server
    local_test_util_test.go:184: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) verifying the tenant version
    local_test_util_test.go:188: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) verifying basic SQL functionality
    local_test_util_test.go:193: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) verifying the version of the storage cluster
    local_test_util_test.go:200: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) upgrading the storage cluster
    local_test_util_test.go:203: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) checking the tenant after the storage cluster upgrade
    local_test_util_test.go:207: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) start upgrading the tenant
    local_test_util_test.go:262: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) upgrader is ready
    local_test_util_test.go:264: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) starting another tenant server
    local_test_util_test.go:306: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) shutting down the other tenant server
    local_test_util_test.go:315: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) waiting for the instance table to get in the right state
    local_test_util_test.go:326: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) resuming upgrade
    local_test_util_test.go:328: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) waiting for upgrade to complete
    local_test_util_test.go:330: (TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances) upgrade completed
    local_test_util_test.go:339: 
            Error Trace:    pkg/ccl/kvccl/kvtenantccl/upgradeinterlockccl/upgradeinterlockccl_test/pkg/ccl/kvccl/kvtenantccl/upgradeinterlockccl/local_test_util_test.go:339
                                        pkg/ccl/kvccl/kvtenantccl/upgradeinterlockccl/upgradeinterlockccl_test/bazel-out/k8-fastbuild/bin/pkg/ccl/kvccl/kvtenantccl/upgradeinterlockccl/generated_test.go:129
            Error:          Error "migration-job-find-already-completed: aborted in DistSender: result is ambiguous: node unavailable; try another peer" does not contain "preventing SQL server from starting because its binary version is too low for the tenant active version"
            Test:           TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances1619131486
--- FAIL: TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances (9.06s)

Parameters:

attempt=1
run=17
shard=10
Help

/cc @cockroachdb/test-eng _{This test on roachdash | Improve this report!

Jira issue: CRDB-42030}

renatolabs commented 2 months ago

This was assigned to test-eng because it's a test generated by a template (#107885).

I'm reassigning to @cockroachdb/server for analysis.

rimadeodhar commented 2 months ago

@stevendanna will investigate with @shubhamdhama to triage on whether this is a release blocker.

shubhamdhama commented 2 months ago

@stevendanna and I investigated this issue, and here’s what we've found so far:

The creation of otherServer was expected to fail here: https://github.com/cockroachdb/cockroach/blob/d28fe4b6095a1d3c54a8562b64f36abf108b62c2/pkg/server/server_sql.go#L1663-L1668. However, it actually failed here: https://github.com/cockroachdb/cockroach/blob/d28fe4b6095a1d3c54a8562b64f36abf108b62c2/pkg/server/server_sql.go#L1638-L1643 while executing a query to check if the migration had completed.

I240910 08:19:49.458533 25 util/startup/retry.go:150 ⋮ [T10,Vcluster-10,nsql2] 976  failed ‹check if migration completed› during node startup, retrying ‹migration-job-find-already-completed›: aborted in DistSender: result is ambiguous: node unavailable; try another peer

To understand why the current server was available, we need to examine the logs when otherServer was being created.

I240910 08:19:49.175570 25 sql/sqlliveness/slinstance/slinstance.go:441 ⋮ [T10,Vcluster-10,nsql?] 950  starting SQL liveness instance
I240910 08:19:49.178569 7840 sql/sqlliveness/slstorage/slstorage.go:544 ⋮ [T10,Vcluster-10,nsql?] 951  inserted sqlliveness session 0101803af65fd7f1cb452eb2ab8c8fb42ecee6 with expiry 1725956389.425603341,0
I240910 08:19:49.178590 7840 sql/sqlliveness/slinstance/slinstance.go:257 ⋮ [T10,Vcluster-10,nsql?] 952  created new SQL liveness session 0101803af65fd7f1cb452eb2ab8c8fb42ecee6
I240910 08:19:49.178610 25 sql/sqlinstance/instancestorage/instancestorage.go:304 ⋮ [T10,Vcluster-10,nsql?] 953  assigning instance id to rpc addr ‹127.0.0.1:33907› and sql addr ‹127.0.0.1:34397›
I240910 08:19:49.179799 25 server/server_sql.go:1542 ⋮ [T10,Vcluster-10,nsql?] 954  bound sqlinstance: Instance{RegionPrefix: gA==, InstanceID: 2, SQLAddr: ‹127.0.0.1:34397›, RPCAddr: ‹127.0.0.1:33907›, SessionID: 0101803af65fd7f1cb452eb2ab8c8fb42ecee6, Locality: , BinaryVersion: 23.1}
I240910 08:19:49.179903 7861 sql/sqlstats/persistedsqlstats/provider.go:175 ⋮ [T10,Vcluster-10,nsql2] 955  starting sql-stats-worker with initial delay: 9m28.468437879s
I240910 08:19:49.180239 25 upgrade/upgrademanager/manager.go:171 ⋮ [T10,Vcluster-10,nsql2] 956  running permanent upgrades up to version: 22.1
I240910 08:19:49.182368 7890 kv/kvclient/rangefeed/rangefeedcache/watcher.go:345 ⋮ [T10,Vcluster-10,nsql2] 957  system-config-cache: established range feed cache
I240910 08:19:49.443884 7858 sql/temporary_schema.go:504 ⋮ [T10,Vcluster-10,nsql2] 958  running temporary object cleanup background job
I240910 08:19:49.446568 7956 sql/sqlliveness/slstorage/slstorage.go:395 ⋮ [nsql1] 959  deleted session 0101803af65fd7f1cb452eb2ab8c8fb42ecee6 which expired at 1725956389.425603341,0
W240910 08:19:49.452650 7840 sql/sqlliveness/slinstance/slinstance.go:341 ⋮ [T10,Vcluster-10,nsql2] 960  exiting heartbeat loop
W240910 08:19:49.452672 7840 sql/sqlliveness/slinstance/slinstance.go:328 ⋮ [T10,Vcluster-10,nsql2] 961  exiting heartbeat loop with error: session record deleted 
I240910 08:19:49.452695 7749 server/testserver.go:1740 ⋮ [T10,Vcluster-10,nsql2] 962  server requesting spontaneous shutdown: sql liveness session deleted
I240910 08:19:49.452926 7783 server/start_listen.go:110 ⋮ [T10,Vcluster-10,nsql2] 963  server shutting down: instructing cmux to stop accepting

At 08:19:49.178569, the SQL liveness session for server nsql2 was created with an expiration timestamp of 1725956389.425603341 (08:19:49.425603341). However, this session was deleted shortly after by nsql1 at 08:19:49.446568. As a result, nsql2 began shutting down, and everything that followed, including the unexpected query failure, was a consequence of this shutdown. nsql1 detected the session as expired is due to the test setup: https://github.com/cockroachdb/cockroach/blob/9d04b4704b144bb1095ca29a23c6ef4bb3576233/pkg/ccl/kvccl/kvtenantccl/upgradeinterlockccl/local_test_util_test.go#L102-L119. In this test, we set the ttl to 250ms and the heartbeat to 25ms. It's unclear why nsql1 deleted nsql2's session.

Overall, this doesn’t seem to be an upgrade-related issue. However, we still need to determine why nsql1 deleted nsql2's session. I believe the SQL Foundations team has more expertise with the sqlliveness infrastructure, so I’m moving this to them. Feel free to reassign it to the appropriate team if needed. I don't believe this is a release blocker, as we’re overriding many settings with values far from a real production setup, which could be causing this issue.

rafiss commented 2 months ago

I see that even in a successful run, the behavior of nsql1 deleting nsql2's session occurs. Example logs from a successful run:

I241010 17:17:35.280547 7278 sql/sqlliveness/slinstance/slinstance.go:252 ⋮ [T10,Vcluster-10,nsql?] 987  created new SQL liveness session 010180ddaeeb011ef54243b780c2ad71e6b623
I241010 17:17:35.280557 66 sql/sqlinstance/instancestorage/instancestorage.go:299 ⋮ [T10,Vcluster-10,nsql?] 988  assigning instance id to rpc addr ‹127.0.0.1:57629› and sql addr ‹127.0.
0.1:57627›
I241010 17:17:35.281130 66 server/server_sql.go:1538 ⋮ [T10,Vcluster-10,nsql?] 989  bound sqlinstance: Instance{RegionPrefix: gA==, InstanceID: 2, SQLAddr: ‹127.0.0.1:57627›, RPCAddr: ‹
127.0.0.1:57629›, SessionID: 010180ddaeeb011ef54243b780c2ad71e6b623, Locality: , BinaryVersion: 23.1}
I241010 17:17:35.281203 7304 sql/sqlstats/persistedsqlstats/provider.go:170 ⋮ [T10,Vcluster-10,nsql2] 990  starting sql-stats-worker with initial delay: 9m47.846479196s
I241010 17:17:35.281218 66 upgrade/upgrademanager/manager.go:166 ⋮ [T10,Vcluster-10,nsql2] 991  running permanent upgrades up to version: 22.1
I241010 17:17:35.281518 7317 kv/kvclient/rangefeed/rangefeedcache/watcher.go:340 ⋮ [T10,Vcluster-10,nsql2] 992  system-config-cache: established range feed cache
I241010 17:17:35.282222 7301 sql/temporary_schema.go:499 ⋮ [T10,Vcluster-10,nsql2] 993  running temporary object cleanup background job
I241010 17:17:35.285367 66 upgrade/upgrademanager/manager.go:234 ⋮ [T10,Vcluster-10,nsql2] 994  the last permanent upgrade (v0.0-upgrading-step-004) does not appear to have completed; a
ttempting to run all upgrades
I241010 17:17:35.285378 66 upgrade/upgrademanager/manager.go:240 ⋮ [T10,Vcluster-10,nsql2] 995  running permanent upgrade for version 0.0-upgrading-step-004
I241010 17:17:35.286188 66 server/server_sql.go:1640 ⋮ [T10,Vcluster-10,nsql2] 996  done ensuring all necessary startup migrations have run
I241010 17:17:35.286275 7301 sql/temporary_schema.go:549 ⋮ [T10,Vcluster-10,nsql2] 997  found 0 temporary schemas
I241010 17:17:35.286287 7301 sql/temporary_schema.go:552 ⋮ [T10,Vcluster-10,nsql2] 998  early exiting temporary schema cleaner as no temporary schemas were found
I241010 17:17:35.286291 7301 sql/temporary_schema.go:553 ⋮ [T10,Vcluster-10,nsql2] 999  completed temporary object cleanup job
I241010 17:17:35.286294 7301 sql/temporary_schema.go:634 ⋮ [T10,Vcluster-10,nsql2] 1000  temporary object cleaner next scheduled to run at 2024-10-10 17:47:35.281231 +0000 UTC m=+1801.5
76187335
I241010 17:17:35.286559 7218 server/start_listen.go:105 ⋮ [T10,Vcluster-10,nsql2] 1001  server shutting down: instructing cmux to stop accepting
W241010 17:17:35.286605 7222 kv/kvclient/kvtenant/connector.go:417 ⋮ [T10,Vcluster-10,nsql2,tenant-connector] 1002  error consuming GossipSubscription RPC: recv msg error: grpc: ‹contex
t canceled› [code 1/Canceled]
W241010 17:17:35.286641 7278 sql/sqlliveness/slinstance/slinstance.go:336 ⋮ [T10,Vcluster-10,nsql2] 1003  exiting heartbeat loop
W241010 17:17:35.286648 7278 sql/sqlliveness/slinstance/slinstance.go:323 ⋮ [T10,Vcluster-10,nsql2] 1004  exiting heartbeat loop with error: node unavailable; try another peer
W241010 17:17:35.286678 7223 kv/kvclient/kvtenant/setting_overrides.go:62 ⋮ [T10,Vcluster-10,nsql2,tenant-connector] 1005  error consuming TenantSettings RPC: recv msg error: grpc: ‹con
text canceled› [code 1/Canceled]
E241010 17:17:35.286831 7313 jobs/registry.go:947 ⋮ [T10,Vcluster-10,nsql2] 1006  failed to serve pause and cancel requests: could not query jobs table: ‹cancel/pause-requested›: failed
 to read query result: query execution canceled
I241010 17:17:35.287157 2070 kv/kvclient/kvtenant/setting_overrides.go:232 ⋮ [T10,Vcluster-10,nsql1,tenant-connector] 1007  received 4 setting overrides with precedence ‹TENANT_SPECIFIC
_OVERRIDES› (incremental=false)
I241010 17:17:35.287170 2070 kv/kvclient/kvtenant/setting_overrides.go:134 ⋮ [T10,Vcluster-10,nsql1,tenant-connector] 1008  received initial tenant settings
I241010 17:17:35.287186 2079 server/settingswatcher/settings_watcher.go:509 ⋮ [T10,Vcluster-10,nsql1] 1009  updating storage cluster cached version from: 24.1 to: 24.1
E241010 17:17:35.287997 7316 spanconfig/spanconfigmanager/manager.go:133 ⋮ [T10,Vcluster-10,nsql2] 1010  error starting auto span config reconciliation job: ‹find-running-jobs-of-type›:
 aborted in DistSender: result is ambiguous: node unavailable; try another peer
I241010 17:17:35.295198 7196 4@util/log/event_log.go:27 ⋮ [T10,Vcluster-10,nsql1,client=127.0.0.1:57633,hostssl,user=root] 1011 ={"Timestamp":1728580655295197000,"EventType":"client_aut
hentication_ok","InstanceID":1,"Network":"tcp","RemoteAddress":"‹127.0.0.1:57633›","SessionID":"17fd26f29479aa280000000000000001","Transport":"hostssl","User":"‹root›","SystemIdentity":
"‹root›","Method":"cert-password"}
I241010 17:17:35.541545 186 1@gossip/gossip.go:1419 ⋮ [T1,Vsystem,n1] 1012  node has connected to cluster via gossip
I241010 17:17:35.541641 186 kv/kvserver/stores.go:278 ⋮ [T1,Vsystem,n1] 1013  wrote 0 node addresses to persistent storage
I241010 17:17:35.652038 7845 sql/sqlliveness/slstorage/slstorage.go:390 ⋮ [nsql1] 1014  deleted session 010180ddaeeb011ef54243b780c2ad71e6b623 which expired at 1728580655.529896000,0

So I don't think that is the problem. The test is intentionally setting the expiration of the SQL Liveness session to be very small, and the expired session cleanup is working as intended.

The issue here seems to be that the test is expecting a specific error to occur during startup, but due to the expired/deleted session, it seems that different kinds of errors are possible.

Since nothing seems off in sqlliveness, I'll move this back to DB Server to decide if they want to update the error assertions. However, I also could not repro this after 1000 runs with ./dev test pkg/ccl/kvccl/kvtenantccl/upgradeinterlockccl -f=TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances --stress (ran on the 24.1 branch), so this flake seems quite rare and is likely safe to ignore.

cockroachdb / cockroach

pkg/ccl/kvccl/kvtenantccl/upgradeinterlockccl/upgradeinterlockccl_test: TestTenantUpgradeInterlock_lagging_binary_version_pause_after_second_check_of_instances failed #130394