Closed gregory112 closed 1 year ago
Hello, I am Blathers. I am here to help you get the issue triaged.
Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.
I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
cc @cockroachdb/replication
@erikgrinaker @aliher1911 is this the type of problem that the retry logic we just implemented is aiming to fix?
No, this is unrelated. This is an assertion failure because a closed timestamp regression has made its way into the Raft log. This could be due to clock/hardware issues, or it could be due to bugs around closed timestamps and reproposals that @tbg has been hunting down.
In any case, to get past the error, you can set COCKROACH_RAFT_CLOSEDTS_ASSERTIONS_ENABLED=false
like described in the error message. Because the assertion failure is in the liveness range, this should be harmless (the only effect would be on rangefeeds or follower reads, which are not relevant here).
Thanks, adding that environment variable lets all nodes run again. This is related to this issue that I also reported: https://github.com/cockroachdb/cockroach/issues/102316. Is it going to be harmless if I let all nodes run with this environment variable?
Once you've gotten the cluster back up and running, you should be able to restart the nodes without the envvar again. There were a few pending commands in the Raft log that violated the assertion, once those commands have been applied the envvar is no longer needed.
Alright thanks, but based on my experience killing the cluster and starting from scratch again with full cluster restore usually leads to the same problem again after a while, so I think I'll leave it.
Should this issue be closed or be left open? Anything to be done for this issue?
Alright thanks, but based on my experience killing the cluster and starting from scratch again with full cluster restore usually leads to the same problem again after a while, so I think I'll leave it.
This, and the error in #102316, indicate problems with clock accuracy in this cluster. CockroachDB requires semi-accurate clocks, with a worst-case clock skew well below 500 ms. Keeping this envvar set will only paper over the underlying clock issue, and may affect the correctness of rangefeeds if you should use them (e.g. for changefeeds, follower reads, or internal system operations like propagation of settings or zone configurations).
I'll close this out since this seems to be an infrastructure issue rather than a CockroachDB issue, but feel free to reopen if you find indications otherwise.
Describe the problem
Hi, I have a three nodes CockroachDB cluster. For some reasons, after another issue which I am still troubleshooting, all nodes go down. The nodes can no longer be recovered, because everytime one node is going up it tries to contact the other nodes, fail, and get its breaker tripped, and it goes down again. This happens to all nodes.
To Reproduce
Restart a whole cluster.
Expected behavior
The node waits for a while until all nodes are up before giving up.
Additional data / screenshots
If applicable, add screenshots to help explain your problem.
Environment:
Additional context
The cluster goes down completely.
Jira issue: CRDB-27423