cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

error: remote wall time is too far ahead #42570

Closed arctica closed 4 years ago

arctica commented 4 years ago

Running CockroachDB 19.2 on OSX High Sierra.

Automatic clock synchronization is enabled in system settings.

Several times today I have encountered my Go apps getting an error like the below:

panic: pq: remote wall time is too far ahead (1m37.893003s) to be trustworthy

When this happened, I verified that my OS clock is right and via cockroach cli SELECT now(); that the CockroachDB time matches the system time. There is certainly no drift of over 1m.

The Cluster UI didn't indicate anything abnormal, the "Clock Offset" graph shows values less than 150us.

I think this happens relatively soon after I restart the CockroachDB process but I can't reliably reproduce it.

What logs or other things would be helpful to look further into the issue?

ricardocrdb commented 4 years ago

Hey @arctica

The cockroach.log from the node should show log messages indicating clock skew. A debug zip can be generated that will pull up the logs from all the nodes, which can then be searched. Let me know if you want to upload the debug zip and we can take a look.

arctica commented 4 years ago

I found this in the .log file:

W191120 16:03:19.083619 76263 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {MacBookAir:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup MacBookAir: no such host". Reconnecting...
W191120 17:03:19.687889 129196 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {MacBookAir:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup MacBookAir: no such host". Reconnecting...
I191121 04:03:19.273541 95 server/status/runtime.go:498  [n1] runtime stats: 221 MiB RSS, 146 goroutines, 128 MiB/117 MiB/191 MiB GO alloc/idle/total, 109 MiB/134 MiB CGO alloc/total, 32.1 CGO/sec, 1.3/0.8 %(u/s)time, 0.0 %gc (0x), 120 KiB/132 KiB (r/w)net
I191121 09:03:19.738617 95 server/status/runtime.go:498  [n1] runtime stats: 234 MiB RSS, 146 goroutines, 158 MiB/89 MiB/195 MiB GO alloc/idle/total, 117 MiB/145 MiB CGO alloc/total, 26.8 CGO/sec, 1.0/0.7 %(u/s)time, 0.0 %gc (0x), 13 KiB/13 KiB (r/w)net
W191121 17:03:19.436769 543023 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {MacBookAir:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup MacBookAir: no such host". Reconnecting...
W191122 03:19:31.251676 93 storage/closedts/provider/provider.go:150  [ct-closer] unable to move closed timestamp forward: not live
W191122 03:19:32.761497 592730 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {MacBookAir:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup MacBookAir: no such host". Reconnecting...
I191122 03:19:33.534908 91 gossip/gossip.go:1531  [n1] node has connected to cluster via gossip
W191122 03:19:33.764540 592730 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {MacBookAir:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W191122 03:19:34.315066 592735 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {MacBookAir:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup MacBookAir: no such host". Reconnecting...
W191122 03:19:35.320362 592735 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {MacBookAir:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W191122 03:19:36.329507 592686 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {MacBookAir:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup MacBookAir: no such host". Reconnecting...
W191122 03:19:37.331948 592686 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {MacBookAir:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
I191122 03:19:37.590033 95 server/status/runtime.go:498  [n1] runtime stats: 98 MiB RSS, 138 goroutines, 161 MiB/86 MiB/208 MiB GO alloc/idle/total, 86 MiB/109 MiB CGO alloc/total, 0.1 CGO/sec, 0.0/0.0 %(u/s)time, 0.0 %gc (0x), 28 KiB/20 KiB (r/w)net
W191122 03:19:01.033173 181 internal/client/txn.go:520  [n1,liveness-hb] failure aborting transaction: remote wall time is too far ahead (38.375096s) to be trustworthy; abort caused by: remote wall time is too far ahead (38.37541s) to be trustworthy
W191122 03:19:01.033241 181 storage/node_liveness.go:484  [n1,liveness-hb] failed node liveness heartbeat: remote wall time is too far ahead (38.37541s) to be trustworthy
E191122 03:19:01.691146 593572 server/status.go:977  [n1,status] remote wall time is too far ahead (37.717039s) to be trustworthy
E191122 03:19:03.703927 593519 server/status.go:977  [n1,status] remote wall time is too far ahead (35.704287s) to be trustworthy
W191122 03:19:04.562114 96 ts/db.go:194  [n1,ts-poll] error writing time series data: remote wall time is too far ahead (34.8468s) to be trustworthy
W191122 03:19:04.563891 207 server/node.go:797  [n1,summaries] error recording status summaries: remote wall time is too far ahead (34.844289s) to be trustworthy
W191122 03:19:05.531352 181 internal/client/txn.go:520  [n1,liveness-hb] failure aborting transaction: remote wall time is too far ahead (33.876895s) to be trustworthy; abort caused by: remote wall time is too far ahead (33.877139s) to be trustworthy
W191122 03:19:05.531392 181 storage/node_liveness.go:484  [n1,liveness-hb] failed node liveness heartbeat: remote wall time is too far ahead (33.877139s) to be trustworthy
E191122 03:19:05.712839 593606 server/status.go:977  [n1,status] remote wall time is too far ahead (33.695369s) to be trustworthy
I191122 03:19:06.927772 95 server/status/runtime.go:498  [n1] runtime stats: 165 MiB RSS, 150 goroutines, 126 MiB/116 MiB/211 MiB GO alloc/idle/total, 86 MiB/109 MiB CGO alloc/total, 12.6 CGO/sec, 0.7/0.6 %(u/s)time, 0.0 %gc (0x), 345 KiB/38 KiB (r/w)net
E191122 03:19:07.722356 593646 server/status.go:977  [n1,status] remote wall time is too far ahead (31.685845s) to be trustworthy
E191122 03:19:07.802748 210 jobs/registry.go:326  error while adopting jobs: adopt-job: remote wall time is too far ahead (31.605631s) to be trustworthy
W191122 03:19:07.999389 592949 storage/replica_write.go:200  [n1,s1,r2/1:/System/NodeLiveness{-Max}] have been waiting 60.00s for proposing command RequestLease [/System/NodeLiveness,/Min).
W191122 03:19:07.999411 592948 storage/replica_write.go:200  [n1,gc,s1,r2/1:/System/NodeLiveness{-Max}] have been waiting 60.00s for proposing command GC [/System/NodeLiveness,/System/NodeLivenessMax).
E191122 03:19:09.733654 593662 server/status.go:977  [n1,status] remote wall time is too far ahead (29.674544s) to be trustworthy
W191122 03:19:10.031025 181 internal/client/txn.go:520  [n1,liveness-hb] failure aborting transaction: remote wall time is too far ahead (29.377214s) to be trustworthy; abort caused by: remote wall time is too far ahead (29.37744s) to be trustworthy
W191122 03:19:10.031095 181 storage/node_liveness.go:484  [n1,liveness-hb] failed node liveness heartbeat: remote wall time is too far ahead (29.37744s) to be trustworthy
E191122 03:19:11.743326 593607 server/status.go:977  [n1,status] remote wall time is too far ahead (27.664862s) to be trustworthy
E191122 03:19:13.761106 593720 server/status.go:977  [n1,status] remote wall time is too far ahead (25.647092s) to be trustworthy
W191122 03:19:14.531190 181 internal/client/txn.go:520  [n1,liveness-hb] failure aborting transaction: remote wall time is too far ahead (24.877029s) to be trustworthy; abort caused by: remote wall time is too far ahead (24.877279s) to be trustworthy
W191122 03:19:14.531241 181 storage/node_liveness.go:484  [n1,liveness-hb] failed node liveness heartbeat: remote wall time is too far ahead (24.877279s) to be trustworthy
W191122 03:19:14.560137 207 server/node.go:797  [n1,summaries] error recording status summaries: remote wall time is too far ahead (24.848071s) to be trustworthy
W191122 03:19:14.565785 96 ts/db.go:194  [n1,ts-poll] error writing time series data: remote wall time is too far ahead (24.843816s) to be trustworthy
E191122 03:19:15.769354 593711 server/status.go:977  [n1,status] remote wall time is too far ahead (23.638845s) to be trustworthy
I191122 03:19:16.938554 95 server/status/runtime.go:498  [n1] runtime stats: 166 MiB RSS, 150 goroutines, 132 MiB/111 MiB/211 MiB GO alloc/idle/total, 86 MiB/109 MiB CGO alloc/total, 12.6 CGO/sec, 0.8/0.7 %(u/s)time, 0.0 %gc (0x), 89 KiB/28 KiB (r/w)net
E191122 03:19:17.783343 593696 server/status.go:977  [n1,status] remote wall time is too far ahead (21.624879s) to be trustworthy
W191122 03:19:19.035431 181 internal/client/txn.go:520  [n1,liveness-hb] failure aborting transaction: remote wall time is too far ahead (20.372857s) to be trustworthy; abort caused by: remote wall time is too far ahead (20.373234s) to be trustworthy
W191122 03:19:19.035496 181 storage/node_liveness.go:484  [n1,liveness-hb] failed node liveness heartbeat: remote wall time is too far ahead (20.373234s) to be trustworthy
E191122 03:19:19.801475 593798 server/status.go:977  [n1,status] remote wall time is too far ahead (19.606715s) to be trustworthy
E191122 03:19:21.824184 593791 server/status.go:977  [n1,status] remote wall time is too far ahead (17.584129s) to be trustworthy
W191122 03:19:23.531320 181 internal/client/txn.go:520  [n1,liveness-hb] failure aborting transaction: remote wall time is too far ahead (15.876926s) to be trustworthy; abort caused by: remote wall time is too far ahead (15.877091s) to be trustworthy
W191122 03:19:23.531353 181 storage/node_liveness.go:484  [n1,liveness-hb] failed node liveness heartbeat: remote wall time is too far ahead (15.877091s) to be trustworthy
E191122 03:19:23.824620 593793 server/status.go:977  [n1,status] remote wall time is too far ahead (15.583571s) to be trustworthy
W191122 03:19:24.559547 96 ts/db.go:194  [n1,ts-poll] error writing time series data: remote wall time is too far ahead (14.849608s) to be trustworthy
W191122 03:19:24.561796 207 server/node.go:797  [n1,summaries] error recording status summaries: remote wall time is too far ahead (14.846386s) to be trustworthy
E191122 03:19:25.835882 593875 server/status.go:977  [n1,status] remote wall time is too far ahead (13.572337s) to be trustworthy
I191122 03:19:26.940708 95 server/status/runtime.go:498  [n1] runtime stats: 168 MiB RSS, 150 goroutines, 138 MiB/106 MiB/211 MiB GO alloc/idle/total, 86 MiB/109 MiB CGO alloc/total, 12.6 CGO/sec, 0.7/0.8 %(u/s)time, 0.0 %gc (0x), 98 KiB/28 KiB (r/w)net
E191122 03:19:27.846023 593903 server/status.go:977  [n1,status] remote wall time is too far ahead (11.562188s) to be trustworthy
W191122 03:19:28.035707 181 internal/client/txn.go:520  [n1,liveness-hb] failure aborting transaction: remote wall time is too far ahead (11.372546s) to be trustworthy; abort caused by: remote wall time is too far ahead (11.372826s) to be trustworthy
W191122 03:19:28.035769 181 storage/node_liveness.go:484  [n1,liveness-hb] failed node liveness heartbeat: remote wall time is too far ahead (11.372826s) to be trustworthy
E191122 03:19:29.182660 593908 storage/queue.go:1032  [n1,merge,s1,r171/1:/System/tsd/cr.node.s{ql…-ys…}] remote wall time is too far ahead (10.225604s) to be trustworthy
E191122 03:19:29.859070 593905 server/status.go:977  [n1,status] remote wall time is too far ahead (9.549125s) to be trustworthy
E191122 03:19:31.877702 593933 server/status.go:977  [n1,status] remote wall time is too far ahead (7.530494s) to be trustworthy
W191122 03:19:32.531405 181 internal/client/txn.go:520  [n1,liveness-hb] failure aborting transaction: remote wall time is too far ahead (6.876813s) to be trustworthy; abort caused by: remote wall time is too far ahead (6.877033s) to be trustworthy
W191122 03:19:32.531449 181 storage/node_liveness.go:484  [n1,liveness-hb] failed node liveness heartbeat: remote wall time is too far ahead (6.877033s) to be trustworthy
E191122 03:19:33.884171 593954 server/status.go:977  [n1,status] remote wall time is too far ahead (5.524017s) to be trustworthy
W191122 03:19:34.562057 207 server/node.go:797  [n1,summaries] error recording status summaries: remote wall time is too far ahead (4.846164s) to be trustworthy
W191122 03:19:34.568271 96 ts/db.go:194  [n1,ts-poll] error writing time series data: remote wall time is too far ahead (4.842313s) to be trustworthy
E191122 03:19:35.891039 593978 server/status.go:977  [n1,status] remote wall time is too far ahead (3.517147s) to be trustworthy
I191122 03:19:36.951219 95 server/status/runtime.go:498  [n1] runtime stats: 169 MiB RSS, 149 goroutines, 143 MiB/101 MiB/211 MiB GO alloc/idle/total, 86 MiB/109 MiB CGO alloc/total, 12.6 CGO/sec, 0.8/0.8 %(u/s)time, 0.0 %gc (0x), 31 KiB/28 KiB (r/w)net
W191122 03:19:37.036414 181 internal/client/txn.go:520  [n1,liveness-hb] failure aborting transaction: remote wall time is too far ahead (2.371869s) to be trustworthy; abort caused by: remote wall time is too far ahead (2.372334s) to be trustworthy
W191122 03:19:37.036477 181 storage/node_liveness.go:484  [n1,liveness-hb] failed node liveness heartbeat: remote wall time is too far ahead (2.372334s) to be trustworthy
W191122 03:19:37.583687 16 util/hlc/hlc.go:313  remote wall time is too far ahead (1.824497s) to be trustworthy - updating anyway
W191122 03:19:37.584645 16 storage/store_raft.go:514  [n1,s1,r2/1:/System/NodeLiveness{-Max}] handle raft ready: 89.6s [applied=1, batches=1, state_assertions=1]
I191122 03:19:37.584859 592949 storage/replica_write.go:219  [n1,s1,r2/1:/System/NodeLiveness{-Max}] slow command RequestLease [/System/NodeLiveness,/Min) finished after 89.59s with error <nil>
W191122 03:19:37.586086 16 util/hlc/hlc.go:313  remote wall time is too far ahead (1.822081s) to be trustworthy - updating anyway
I191122 03:19:37.586342 592948 storage/replica_write.go:219  [n1,gc,s1,r2/1:/System/NodeLiveness{-Max}] slow command GC [/System/NodeLiveness,/System/NodeLivenessMax) finished after 89.59s with error result is ambiguous (context deadline exceeded)
E191122 03:19:37.587983 592948 storage/queue.go:1032  [n1,gc,s1,r2/1:/System/NodeLiveness{-Max}] operation "gc queue process replica 2" timed out after 1m0s: failed to set GC thresholds: result is ambiguous (context deadline exceeded)
W191122 03:19:37.590685 110 util/hlc/hlc.go:313  remote wall time is too far ahead (1.817492s) to be trustworthy - updating anyway
W191122 03:19:37.592618 125 util/hlc/hlc.go:313  remote wall time is too far ahead (1.815557s) to be trustworthy - updating anyway
E191122 03:19:37.806211 210 jobs/registry.go:326  error while adopting jobs: adopt-job: remote wall time is too far ahead (1.602104s) to be trustworthy
E191122 03:19:37.907181 593915 server/status.go:977  [n1,status] remote wall time is too far ahead (1.501046s) to be trustworthy
I191122 03:19:44.530914 62 gossip/gossip.go:566  [n1] gossip status (ok, 1 node)

Two interesting points stick out to me:

  1. First the connection errors where it can't resolve the local hostname for some reason. This might maybe happen after the laptop comes back from sleep? Not sure really and can't experiment right now.
  2. The clock skew slowly decreases until everything is ok again. I guess the clock prevents bigger jumps.

Please note this is a development single local node of CockroachDB, no other nodes in a cluster.

ricardocrdb commented 4 years ago

Hey @arctica

Thanks for the logging and the additional background. I tried to replicate the sleep connection issue on OSX Catalina and have been unable to after putting my macbook pro to sleep. It looks like my machine just picks up where it left off.

What type of host name are you using for listen-addr in your cockroach start command? The suggestion to use localhost when running a single node could prevent this from occurring.

Let me know what you find, or if you have any other information.

arctica commented 4 years ago

@ricardocrdb Hi, I also tried to replicate it by putting my Macbook to sleep but couldn't trigger the issue yet. So maybe it's not related to hibernation. I'll see whenever it happens again what the reason might be.

As to your question: I started it without listen-addr so it took the hostname of the machine. I also saw there is a start-signle-node command which might also prevent this issue. I'll keep an eye on this and report back whenever I can get more info.

arctica commented 4 years ago

I think I've got a lead regarding the issue. I've turned off automatic time sync in OSX and checked via ntpdate -q time.apple.com that over time (usually also putting the laptop into sleep) the clock will drift quite far. The delay will go up to a bit over 100s relatively quickly and will stay around that value. I've verified that it takes a couple hours to get to that level and it'll stay there for several days. That alone is a bit strange but ok. Note the timestamps below:

24 Nov 10:50:53 ntpdate[5339]: adjust time server 17.253.82.125 offset 0.006451 sec

24 Nov 13:00:52 ntpdate[7188]: step time server 17.253.14.253 offset -102.568760 sec

25 Nov 08:56:47 ntpdate[14600]: step time server 17.253.82.125 offset -102.317288 sec

27 Nov 20:02:39 ntpdate[49206]: step time server 17.253.82.125 offset -104.335864 sec

All this time queries are running fine. Then I enabled time synchronization again in OSX and verified that it did indeed adjust the clock:

27 Nov 20:07:57 ntpdate[49350]: adjust time server 17.253.82.125 offset 0.013666 sec

At this point queries started to fail. I kept on running a simple query to verify things would get back to normal:

root@:26257/x> select * from t;
pq: remote wall time is too far ahead (1m9.507777s) to be trustworthy
root@:26257/x> select * from t;
pq: remote wall time is too far ahead (41.577621s) to be trustworthy
root@:26257/x> select * from t;
pq: remote wall time is too far ahead (17.224311s) to be trustworthy
root@:26257/x> select * from t;
pq: remote wall time is too far ahead (6.929403s) to be trustworthy
root@:26257/x> select * from t;
pq: remote wall time is too far ahead (1.760538s) to be trustworthy
root@:26257/x> select * from t;
...
(1 row)

So clearly the issue arises when the OS clock makes a jump. Under normal conditions that shouldn't happen as time should be synchronized but laptops go to sleep and might drift too much during the window where no synchronization can happen. Theoretically this could be also an issue for devices with intermittent network connections. Not sure if there's even a sensible way to work around this issue.

andreimatei commented 4 years ago

Seems like there's nothing to do here, is there?

arctica commented 4 years ago

@andreimatei I'm not so sure if something can be done. It definitely was confusing to me and took some time to figure out what was going on. "Remote wall time" was not clear what is being referred to. The issue was not in a cluster but just a local one node Cockroach installation. I'm not sure if Cockroach should be able to handle jumps in the local wall clock more gracefully or at least give a more helpful error message. I assume the clock in Cockroach does not follow the same jump as the OS clock but instead smoothes it out.