cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.87k stars 3.77k forks source link

cli: TestHalfOnlineLossOfQuorumRecovery failed #127857

Closed cockroach-teamcity closed 4 days ago

cockroach-teamcity commented 1 month ago

cli.TestHalfOnlineLossOfQuorumRecovery failed with artifacts on master @ d6381b7a0e2b18617c0a0b23db38e7103457a79e:

=== RUN   TestHalfOnlineLossOfQuorumRecovery
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/6c689cc523db9d8596d03c0a6ef213e4/logTestHalfOnlineLossOfQuorumRecovery2282145052
    test_log_scope.go:81: use -show-logs to present logs inline
[debug recover make-plan --confirm=y --certs-dir=test_certs --host=127.0.0.1:46547 --plan=/artifacts/tmp/_tmp/6c689cc523db9d8596d03c0a6ef213e4/TestHalfOnlineLossOfQuorumRecovery3047652729/recovery-plan.json]
[debug recover apply-plan --certs-dir=test_certs --host=127.0.0.1:46547 --confirm=y /artifacts/tmp/_tmp/6c689cc523db9d8596d03c0a6ef213e4/TestHalfOnlineLossOfQuorumRecovery3047652729/recovery-plan.json]
[debug recover verify --certs-dir=test_certs --host=127.0.0.1:46547 /artifacts/tmp/_tmp/6c689cc523db9d8596d03c0a6ef213e4/TestHalfOnlineLossOfQuorumRecovery3047652729/recovery-plan.json]
[debug recover verify --certs-dir=test_certs --host=127.0.0.1:46547 /artifacts/tmp/_tmp/6c689cc523db9d8596d03c0a6ef213e4/TestHalfOnlineLossOfQuorumRecovery3047652729/recovery-plan.json]
    debug_recover_loss_of_quorum_test.go:641: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/cli/debug_recover_loss_of_quorum_test.go:641
            Error:          "debug recover verify --certs-dir=test_certs --host=127.0.0.1:46547 /artifacts/tmp/_tmp/6c689cc523db9d8596d03c0a6ef213e4/TestHalfOnlineLossOfQuorumRecovery3047652729/recovery-plan.json\nChecking application of recovery plan eeee6454-ba35-4385-b4e2-7d710ed02393\nUnavailable ranges:\n r1 : loss of quorum, key span /{Min-System/NodeLiveness}\n r2 : loss of quorum, key span /System/NodeLiveness{-Max}\n r3 : loss of quorum, key span /System/{NodeLivenessMax-tsd}\n r4 : loss of quorum, key span /System{/tsd-tse}\n r5 : loss of quorum, key span /{Systemtse-Table/0}\n r6 : loss of quorum, key span /Table/{0-3}\n r7 : loss of quorum, key span /Table/{3-4}\n r8 : loss of quorum, key span /Table/{4-5}\n r9 : loss of quorum, key span /Table/{5-6}\n r10 : loss of quorum, key span /Table/{6-7}\n r11 : loss of quorum, key span /Table/{7-8}\n r12 : loss of quorum, key span /Table/{8-9}\n r13 : loss of quorum, key span /Table/{9-11}\n r14 : loss of quorum, key span /Table/1{1-2}\n r15 : loss of quorum, key span /Table/1{2-3}\n r16 : loss of quorum, key span /Table/1{3-4}\n r17 : loss of quorum, key span /Table/1{4-5}\n r18 : loss of quorum, key span /Table/1{5-6}\n r19 : loss of quorum, key span /Table/1{6-7}\n r20 : loss of quorum, key span /Table/1{7-8}\nFailed to complete range health check: found more failed ranges than limit 20\nRecovery plan application progress:\n plan applied successfully on node n1\nAll dead nodes are decommissioned.\nERROR: loss of quorum recovery did not fully succeed\n" does not contain "Loss of quorum recovery is complete."
            Test:           TestHalfOnlineLossOfQuorumRecovery
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/6c689cc523db9d8596d03c0a6ef213e4/logTestHalfOnlineLossOfQuorumRecovery2282145052
--- FAIL: TestHalfOnlineLossOfQuorumRecovery (10.84s)
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

/cc @cockroachdb/kv @cockroachdb/server

This test on roachdash | Improve this report!

Jira issue: CRDB-40642

arulajmani commented 1 month ago

I looked at this a bit, and seems like NodeVitality doesn't consider n1 after it's brought back up for some reason. We see n1 is restarted after doing the recovery:

I240729 19:31:30.418892 13 1@util/log/event_log.go:44 â‹® [T1,Vsystem,n1] 820 ={"Timestamp":1722281490418888782,"EventType":"node_restart","NodeID":1,"StartedAt":1722281486767533153,"LastUp":1722281480078506522}

At this point, n2 and n3 have been decommissioned. Shortly after, we see:

E240729 19:31:30.454024 12148 server/auto_upgrade.go:66 â‹® [T1,Vsystem,n1] 850  failed attempt to upgrade cluster version, error: no live nodes found

That "no live nodes" bit is suspect. It means that this must have returned false for n1 as well, which is unexpected:

https://github.com/cockroachdb/cockroach/blob/786c3c6d3a8ea9e5cb6338f4d4b399f89a67e713/pkg/server/auto_upgrade.go#L176

This then means that the verify phase of LoQ recovery would consider all ranges (which have only one replica which is on n1) to fail the health check:

https://github.com/cockroachdb/cockroach/blob/89306bbda0210de93e77230aeb0d7b0bdb3d09b6/pkg/kv/kvserver/loqrecovery/server.go#L580-L593

Because isNodeLive is determined using the same logic which logged the "no live nodes" above:

https://github.com/cockroachdb/cockroach/blob/89306bbda0210de93e77230aeb0d7b0bdb3d09b6/pkg/kv/kvserver/loqrecovery/server.go#L523-L530

@andrewbaptist, given you're most familiar with this NodeVitality stuff, does something jump out to you?

andrewbaptist commented 1 month ago

I will take more of a look at this. Currently node liveness is transferred over gossip. I'm wondering if there was some problem with n1 either joining gossip after restart or possibly due to some recent gossip throttling it doesn't have the liveness records or store descriptors for itself.

github-actions[bot] commented 1 week ago

We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.