Open gregory112 opened 1 year ago
Hello, I am Blathers. I am here to help you get the issue triaged.
Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.
I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
cc @cockroachdb/disaster-recovery
Just to clarify, when you say you restored from backup, do you mean you ran the RESTORE statement to restore a backup produced using BACKUP? Or were you using something else (volume snapshots or similar)?
Yes I was using RESTORE to restore a full cluster backup, not using an SQL statements dump or anything.
Now all of my queries fail with remote wall time is too far ahead error.
Perfect, thanks for clarifying. BACKUP and RESTORE only manipulate the row data within SQL tables, and don't capture or restore the system metadata about the nodes or their clocks, so I don't think we're looking at a backup contains corrupted data or anything. Clock-sync checks are handled by our KV storage layer so I'll forward this over to an expert on that area of the system.
Using 22.2 I still get this clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
in which the node exits with error after that.
Were you able to make any progress on this issue? Can you describe a little more about your environment?
It feels likely that this is something related to OS time issues, but I'm not sure. Can you run a few commands to help track this down?
Are these physical nodes that you are running docker containers on or are they VMs? Can you run docker exec -it <container-id> date
on all the containers just to verify they are all the same. Do you have NTP running on the host?
Thanks, and hopefully we can track it down.
Yes, they all yield exactly the same date. NTPs are working in all nodes with systemd-timesyncd. However, after letting it run for a while all of the nodes start to go down. After all nodes go down, none of them can be started again, they all return with breaker tripped error.
They are all Docker containers that are deployed with Nomad. They are all deployed on 3 nodes, Ubuntu 20.04. All nodes have systemd-timesyncd active, the dates are correct.
The only way to restore the cluster is to delete all data directories from all nodes and do full cluster restore instead. After a while however the error comes back again and all nodes are down again.
I tried adding COCKROACH_RAFT_CLOSEDTS_ASSERTIONS_ENABLED=false
like in https://github.com/cockroachdb/cockroach/issues/102401 and so far I have not seen any errors. Is this harmless? I have only tried this for like half an hour now and all applications are running with heavy SQL queries. I have not seen some problems other than overload issues like EOF or transaction retry errors. I will update this tomorrow if I see anything.
Hi, I still get this error RangeFeed failed to nudge: remote wall time is too far ahead (6h14m13.930493241s) to be trustworthy
. This does not seem like a time issues. I have increased the poll interval for systemd-timesyncd to be as low as 64 seconds. It seems like a 6 hours difference (from the logs) and I don't know why it happens. This happens simultaneously in all three nodes.
Can you check the time inside the container. One way to do that would be to tail the cockroach logs when the nodes are started simultaneously. The timestamps should be approximately the same. It might be that your container config is off.
They are the same yes, I have tailed both logs
If my hardware clock suddenly jumps 6 hours ahead/backward, and then it gets synced again by NTP, would the error go away or persist until next restart? Because I have never seen the time jumps that far.
Here are the ways we know that jumps can be observed:
Note that a regular background ntpd would not correct a sudden jump of 6 hours (because it'd be too large) instead this would need to be corrected by a one-off invocation of ntpdate
.
Finally, to your question: in some cases (most commonly computer going to sleep), cockroachdb will need some time to recover even after observing the right time, if I recall correctly at least 10 minutes.
@knz Thank you for your response. I have three nodes (three virtual machines). Ultimately, I moved all CockroachDB nodes into a single virtual machine, one that I suspect has no problem with the clock. So far I have not observed any problems with the time as all nodes are in the same VM. I have not seen any time jump too in all of the VMs, even though CockroachDB reported time difference, which is confusing. I will try to find other ways to troubleshoot this and will update this issue as soon as I can.
I do still get this error, and after having set up a Grafana dashboard to watch the metrics I see this
Around 15:10 to 15:20 there seems to be a spike in clock offset, but it indicates only 332 ms offset, which should be tolerable (under 500ms). But I then receive error like remote wall time is too far ahead (3h39m ...
in all SQL queries.
15:10 to 15:20 there seems to be a spike in clock offset, but it indicates only 332 ms offset
It's possible the spike was larger but it wasn't caught in the metrics (we have 10 second resolution).
Sorry for the long reply. I have been investigating this. I have redeployed CockroachDB in Kubernetes now instead of Nomad, using the helm chart. The problem still persists even in v23.2. I have tried running a script to detect if there are time jumps, but no, so far I have not seen any time jump. The VMs where CockroachDB are running do have timesyncd enabled too, and even if there was a jump, it was not very far, not like reported in the logs: error writing time series data: remote wall time is too far ahead (1h38m45.093886778s) to be trustworthy
. Sometimes the number goes to even more than 1h, like 9 hours even.
I also note that I do get this error after restoring the database after some time, even in a single node cluster.
Describe the problem
Previously I had a cluster of three nodes. Due to some errors that I don't know why, we decided to kill this cluster and do full restore instead. It was working smoothly until after a while all nodes throw
remote wall time is too far ahead
errors.The time difference is 6 hours and 24 minutes, which makes it weird. I have verified by running
date
commands in all nodes that all time are synced.To Reproduce
I don't know how to reproduce this. Is it possible that the backup contains corrupted data?
Expected behavior It works without this error.
Additional data / screenshots
Logs from one of the node:
Environment:
Additional context
One of the node seems to be randomly restarting with
Docker container exited with non-zero exit code: 7
.Jira issue: CRDB-27391