Closed cockroach-teamcity closed 4 days ago
I looked at this a bit, and seems like NodeVitality
doesn't consider n1
after it's brought back up for some reason. We see n1 is restarted after doing the recovery:
I240729 19:31:30.418892 13 1@util/log/event_log.go:44 â‹® [T1,Vsystem,n1] 820 ={"Timestamp":1722281490418888782,"EventType":"node_restart","NodeID":1,"StartedAt":1722281486767533153,"LastUp":1722281480078506522}
At this point, n2 and n3 have been decommissioned. Shortly after, we see:
E240729 19:31:30.454024 12148 server/auto_upgrade.go:66 â‹® [T1,Vsystem,n1] 850 failed attempt to upgrade cluster version, error: no live nodes found
That "no live nodes" bit is suspect. It means that this must have returned false for n1 as well, which is unexpected:
This then means that the verify phase of LoQ recovery would consider all ranges (which have only one replica which is on n1) to fail the health check:
Because isNodeLive
is determined using the same logic which logged the "no live nodes" above:
@andrewbaptist, given you're most familiar with this NodeVitality stuff, does something jump out to you?
I will take more of a look at this. Currently node liveness is transferred over gossip. I'm wondering if there was some problem with n1 either joining gossip after restart or possibly due to some recent gossip throttling it doesn't have the liveness records or store descriptors for itself.
We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.
cli.TestHalfOnlineLossOfQuorumRecovery failed with artifacts on master @ d6381b7a0e2b18617c0a0b23db38e7103457a79e:
Help
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)
/cc @cockroachdb/kv @cockroachdb/server
This test on roachdash | Improve this report!
Jira issue: CRDB-40642