cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30k stars 3.79k forks source link

remote wall time is too far ahead #102316

Open gregory112 opened 1 year ago

gregory112 commented 1 year ago

Describe the problem

Previously I had a cluster of three nodes. Due to some errors that I don't know why, we decided to kill this cluster and do full restore instead. It was working smoothly until after a while all nodes throw remote wall time is too far ahead errors.

The time difference is 6 hours and 24 minutes, which makes it weird. I have verified by running date commands in all nodes that all time are synced.

To Reproduce

I don't know how to reproduce this. Is it possible that the backup contains corrupted data?

Expected behavior It works without this error.

Additional data / screenshots

Logs from one of the node:

E230426 09:12:26.788251 30657 kv/kvserver/rangefeed/task.go:263 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›,rangefeed] 10269  pushing old intents failed: remote wall time is too far ahead (6h24m27.600111381s) to be trustworthy
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270  failed to gossip cluster ID: info not fresh
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +(1) attached stack trace
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  -- stack trace:
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | github.com/cockroachdb/cockroach/pkg/gossip.init
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  github.com/cockroachdb/cockroach/pkg/gossip/pkg/gossip/infostore.go:82
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | runtime.doInit
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  GOROOT/src/runtime/proc.go:6498
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | runtime.doInit
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  GOROOT/src/runtime/proc.go:6475
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | runtime.doInit
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  GOROOT/src/runtime/proc.go:6475
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | runtime.doInit
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  GOROOT/src/runtime/proc.go:6475
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | runtime.doInit
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  GOROOT/src/runtime/proc.go:6475
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | runtime.doInit
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  GOROOT/src/runtime/proc.go:6475
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | runtime.doInit
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  GOROOT/src/runtime/proc.go:6475
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | runtime.doInit
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  GOROOT/src/runtime/proc.go:6475
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | runtime.doInit
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  GOROOT/src/runtime/proc.go:6475
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | runtime.doInit
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  GOROOT/src/runtime/proc.go:6475
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | runtime.main
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  GOROOT/src/runtime/proc.go:238
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  | runtime.goexit
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +  |  GOROOT/src/runtime/asm_amd64.s:1581
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +Wraps: (2) info not fresh
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +Error types: (1) *withstack.withStack (2) *errutil.leafError
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +Wraps: (2) info not fresh
E230426 09:12:26.824771 234 kv/kvserver/pkg/kv/kvserver/replica_gossip.go:308 ⋮ [n1,s1,r1/1:‹/{Min-System/NodeL…}›] 10270 +Error types: (1) *withstack.withStack (2) *errutil.leafError
W230426 09:12:26.893138 250 kv/txn.go:728 ⋮ [n1,liveness-hb] 10271  failure aborting transaction: remote wall time is too far ahead (6h24m27.494949616s) to be trustworthy; abort caused by: remote wall time is too far ahead (6h24m27.49600019s) to be trustworthy
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272  failed node liveness heartbeat: remote wall time is too far ahead (6h24m27.49600019s) to be trustworthy
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +(1) forced error mark
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | ‹"remote wall time is too far ahead to be trustworthy"›
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/errors/withstack/*withstack.withStack::
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +Wraps: (2)
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | (opaque error wrapper)
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | reportable 0:
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/util/hlc.(*Clock).UpdateAndCheckMaxOffset
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/util/hlc/hlc.go:469
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:85
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:191
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/server/node.go:1006
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:344
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/server/node.go:989
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/server/node.go:1058
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler.func1
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/roachpb/bazel-out/k8-opt/bin/pkg/roachpb/roachpb_go_proto_/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:9493
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/util/tracing.ServerInterceptor.func1
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/util/tracing/grpc_interceptor.go:136
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | google.golang.org/grpc.chainUnaryInterceptors.func1.1
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  google.golang.org/grpc/external/org_golang_google_grpc/server.go:1117
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/rpc.NewServer.func3
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:238
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | google.golang.org/grpc.chainUnaryInterceptors.func1.1
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  google.golang.org/grpc/external/org_golang_google_grpc/server.go:1120
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/rpc.kvAuth.unaryInterceptor
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/auth.go:72
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | google.golang.org/grpc.chainUnaryInterceptors.func1.1
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  google.golang.org/grpc/external/org_golang_google_grpc/server.go:1120
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/rpc.NewServer.func1.1
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:207
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:344
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/rpc.NewServer.func1
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:205
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | google.golang.org/grpc.chainUnaryInterceptors.func1.1
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  google.golang.org/grpc/external/org_golang_google_grpc/server.go:1120
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | google.golang.org/grpc.chainUnaryInterceptors.func1
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  google.golang.org/grpc/external/org_golang_google_grpc/server.go:1122
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  github.com/cockroachdb/cockroach/pkg/roachpb/bazel-out/k8-opt/bin/pkg/roachpb/roachpb_go_proto_/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:9495
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | google.golang.org/grpc.(*Server).processUnaryRPC
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  google.golang.org/grpc/external/org_golang_google_grpc/server.go:1283
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | google.golang.org/grpc.(*Server).handleStream
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  google.golang.org/grpc/external/org_golang_google_grpc/server.go:1620
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | google.golang.org/grpc.(*Server).serveStreams.func1.2
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  google.golang.org/grpc/external/org_golang_google_grpc/server.go:922
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  | runtime.goexit
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +  |  GOROOT/src/runtime/asm_amd64.s:1581
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +Wraps: (3) remote wall time is too far ahead (6h24m27.49600019s) to be trustworthy
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +Error types: (1) *markers.withMark (2) *errbase.opaqueWrapper (3) *errutil.leafError
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +An inability to maintain liveness will prevent a node from participating in a
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +cluster. If this problem persists, it may be a sign of resource starvation or
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +of network connectivity problems. For help troubleshooting, visit:
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +
W230426 09:12:26.893224 250 kv/kvserver/liveness/liveness.go:785 ⋮ [n1,liveness-hb] 10272 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues
I230426 09:12:26.938039 30678 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r106/1:‹/Table/30{3-7}›] 10273  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.452529442s) to be trustworthy
I230426 09:12:26.938097 30675 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r42/1:‹/Table/4{6-7}›] 10274  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.45241217s) to be trustworthy
I230426 09:12:26.938321 30677 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›] 10275  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.449918821s) to be trustworthy
I230426 09:12:26.938403 30676 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r6/1:‹/Table/{SystemCon…-11}›] 10276  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.449882564s) to be trustworthy
I230426 09:12:26.938444 30679 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r43/1:‹/Table/{47-50}›] 10277  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.452541381s) to be trustworthy
W230426 09:12:26.978812 491 sql/stmtdiagnostics/statement_diagnostics.go:168 ⋮ [n1] 10278  error polling for statement diagnostics requests: ‹stmt-diag-poll›: remote wall time is too far ahead (6h24m27.409598631s) to be trustworthy
E230426 09:12:27.037730 30680 kv/kvserver/rangefeed/task.go:263 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›,rangefeed] 10279  pushing old intents failed: remote wall time is too far ahead (6h24m27.350579056s) to be trustworthy
E230426 09:12:27.101753 30681 kv/kvserver/pkg/kv/kvserver/queue.go:1106 ⋮ [n1,replicaGC,s1,r311/1:‹/Table/2{18-20}›] 10280  remote wall time is too far ahead (6h24m27.286456377s) to be trustworthy
W230426 09:12:27.127073 30700 kv/txn.go:728 ⋮ [n1,s1,r311/1:‹/Table/2{18-20}›] 10281  failure aborting transaction: remote wall time is too far ahead (6h24m27.261218336s) to be trustworthy; abort caused by: remote wall time is too far ahead (6h24m27.262357172s) to be trustworthy
E230426 09:12:27.127234 30700 kv/kvserver/pkg/kv/kvserver/replica_range_lease.go:427 ⋮ [n1,s1,r311/1:‹/Table/2{18-20}›] 10282  failed to heartbeat own liveness record: remote wall time is too far ahead (6h24m27.262357172s) to be trustworthy
I230426 09:12:27.138253 30621 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r42/1:‹/Table/4{6-7}›] 10283  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.252260964s) to be trustworthy
I230426 09:12:27.138519 30667 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r43/1:‹/Table/{47-50}›] 10284  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.251919644s) to be trustworthy
I230426 09:12:27.138560 30665 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›] 10285  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.249571988s) to be trustworthy
I230426 09:12:27.138628 30664 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r6/1:‹/Table/{SystemCon…-11}›] 10286  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.249613111s) to be trustworthy
I230426 09:12:27.138933 30666 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r106/1:‹/Table/30{3-7}›] 10287  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.251763586s) to be trustworthy
E230426 09:12:27.288268 30668 kv/kvserver/rangefeed/task.go:263 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›,rangefeed] 10288  pushing old intents failed: remote wall time is too far ahead (6h24m27.100104785s) to be trustworthy
I230426 09:12:27.339982 30684 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r6/1:‹/Table/{SystemCon…-11}›] 10289  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.048640683s) to be trustworthy
I230426 09:12:27.339999 30685 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›] 10290  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.048640453s) to be trustworthy
I230426 09:12:27.340030 30683 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r42/1:‹/Table/4{6-7}›] 10291  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.050934374s) to be trustworthy
I230426 09:12:27.340094 30686 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r106/1:‹/Table/30{3-7}›] 10292  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.050440963s) to be trustworthy
I230426 09:12:27.340359 30687 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r43/1:‹/Table/{47-50}›] 10293  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m27.050096062s) to be trustworthy
E230426 09:12:27.538113 30670 kv/kvserver/rangefeed/task.go:263 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›,rangefeed] 10294  pushing old intents failed: remote wall time is too far ahead (6h24m26.850093484s) to be trustworthy
I230426 09:12:27.539222 30671 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›] 10295  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.848897641s) to be trustworthy
I230426 09:12:27.539423 30707 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r6/1:‹/Table/{SystemCon…-11}›] 10296  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.848784592s) to be trustworthy
I230426 09:12:27.539573 30673 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r43/1:‹/Table/{47-50}›] 10297  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.850960526s) to be trustworthy
I230426 09:12:27.539678 30706 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r42/1:‹/Table/4{6-7}›] 10298  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.850982828s) to be trustworthy
I230426 09:12:27.539718 30672 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r106/1:‹/Table/30{3-7}›] 10299  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.850943038s) to be trustworthy
I230426 09:12:27.741711 30688 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r43/1:‹/Table/{47-50}›] 10300  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.649131862s) to be trustworthy
I230426 09:12:27.741803 30724 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r106/1:‹/Table/30{3-7}›] 10301  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.649134837s) to be trustworthy
I230426 09:12:27.741834 30689 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r42/1:‹/Table/4{6-7}›] 10302  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.649098285s) to be trustworthy
I230426 09:12:27.742039 30723 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›] 10303  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.64659159s) to be trustworthy
I230426 09:12:27.742077 30722 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r6/1:‹/Table/{SystemCon…-11}›] 10304  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.646591633s) to be trustworthy
E230426 09:12:27.788263 30702 kv/kvserver/rangefeed/task.go:263 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›,rangefeed] 10305  pushing old intents failed: remote wall time is too far ahead (6h24m26.600239902s) to be trustworthy
I230426 09:12:27.941733 30624 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r6/1:‹/Table/{SystemCon…-11}›] 10306  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.446571302s) to be trustworthy
I230426 09:12:27.941786 30622 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r43/1:‹/Table/{47-50}›] 10307  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.448908136s) to be trustworthy
I230426 09:12:27.941731 30625 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›] 10309  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.446571378s) to be trustworthy
I230426 09:12:27.941800 30623 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r42/1:‹/Table/4{6-7}›] 10308  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.448911995s) to be trustworthy
I230426 09:12:27.941908 30738 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r106/1:‹/Table/30{3-7}›] 10310  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.448890128s) to be trustworthy
E230426 09:12:28.039669 30739 kv/kvserver/rangefeed/task.go:263 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›,rangefeed] 10311  pushing old intents failed: remote wall time is too far ahead (6h24m26.349115438s) to be trustworthy
E230426 09:12:28.102692 30742 kv/kvserver/pkg/kv/kvserver/queue.go:1106 ⋮ [n1,replicaGC,s1,r26/1:‹/NamespaceTable/{30-Max}›] 10312  remote wall time is too far ahead (6h24m26.285489991s) to be trustworthy
I230426 09:12:28.140517 239 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:700 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›] 10313  RangeFeed closed timestamp 1682500063.221306438,0 is behind by 4m44.919203435s
I230426 09:12:28.142218 30745 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r43/1:‹/Table/{47-50}›] 10315  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.248482465s) to be trustworthy
I230426 09:12:28.142222 30744 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r106/1:‹/Table/30{3-7}›] 10314  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.248441143s) to be trustworthy
I230426 09:12:28.142307 30746 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r42/1:‹/Table/4{6-7}›] 10316  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.248430608s) to be trustworthy
I230426 09:12:28.142437 30743 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›] 10317  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.245832244s) to be trustworthy
I230426 09:12:28.142466 30747 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r6/1:‹/Table/{SystemCon…-11}›] 10318  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.245843343s) to be trustworthy
E230426 09:12:28.288431 30749 kv/kvserver/rangefeed/task.go:263 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›,rangefeed] 10319  pushing old intents failed: remote wall time is too far ahead (6h24m26.10019051s) to be trustworthy
I230426 09:12:28.343309 30734 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r16/1:‹/Table/2{0-1}›] 10320  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.044807883s) to be trustworthy
I230426 09:12:28.343310 30712 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r6/1:‹/Table/{SystemCon…-11}›] 10321  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.044817861s) to be trustworthy
I230426 09:12:28.343362 30736 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r43/1:‹/Table/{47-50}›] 10322  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.047343653s) to be trustworthy
I230426 09:12:28.343857 30735 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r106/1:‹/Table/30{3-7}›] 10323  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.046799084s) to be trustworthy
I230426 09:12:28.343873 30711 kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:727 ⋮ [n1,s1,r42/1:‹/Table/4{6-7}›] 10324  RangeFeed failed to nudge: remote wall time is too far ahead (6h24m26.047027512s) to be trustworthy

Environment:

Additional context

One of the node seems to be randomly restarting with Docker container exited with non-zero exit code: 7.

Jira issue: CRDB-27391

blathers-crl[bot] commented 1 year ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/disaster-recovery

dt commented 1 year ago

Just to clarify, when you say you restored from backup, do you mean you ran the RESTORE statement to restore a backup produced using BACKUP? Or were you using something else (volume snapshots or similar)?

gregory112 commented 1 year ago

Yes I was using RESTORE to restore a full cluster backup, not using an SQL statements dump or anything.

gregory112 commented 1 year ago

Now all of my queries fail with remote wall time is too far ahead error.

dt commented 1 year ago

Perfect, thanks for clarifying. BACKUP and RESTORE only manipulate the row data within SQL tables, and don't capture or restore the system metadata about the nodes or their clocks, so I don't think we're looking at a backup contains corrupted data or anything. Clock-sync checks are handled by our KV storage layer so I'll forward this over to an expert on that area of the system.

gregory112 commented 1 year ago

Using 22.2 I still get this clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset) in which the node exits with error after that.

andrewbaptist commented 1 year ago

Were you able to make any progress on this issue? Can you describe a little more about your environment?

It feels likely that this is something related to OS time issues, but I'm not sure. Can you run a few commands to help track this down?

Are these physical nodes that you are running docker containers on or are they VMs? Can you run docker exec -it <container-id> date on all the containers just to verify they are all the same. Do you have NTP running on the host?

Thanks, and hopefully we can track it down.

gregory112 commented 1 year ago

Yes, they all yield exactly the same date. NTPs are working in all nodes with systemd-timesyncd. However, after letting it run for a while all of the nodes start to go down. After all nodes go down, none of them can be started again, they all return with breaker tripped error.

gregory112 commented 1 year ago

They are all Docker containers that are deployed with Nomad. They are all deployed on 3 nodes, Ubuntu 20.04. All nodes have systemd-timesyncd active, the dates are correct.

gregory112 commented 1 year ago

The only way to restore the cluster is to delete all data directories from all nodes and do full cluster restore instead. After a while however the error comes back again and all nodes are down again.

gregory112 commented 1 year ago

I tried adding COCKROACH_RAFT_CLOSEDTS_ASSERTIONS_ENABLED=false like in https://github.com/cockroachdb/cockroach/issues/102401 and so far I have not seen any errors. Is this harmless? I have only tried this for like half an hour now and all applications are running with heavy SQL queries. I have not seen some problems other than overload issues like EOF or transaction retry errors. I will update this tomorrow if I see anything.

gregory112 commented 1 year ago

Hi, I still get this error RangeFeed failed to nudge: remote wall time is too far ahead (6h14m13.930493241s) to be trustworthy. This does not seem like a time issues. I have increased the poll interval for systemd-timesyncd to be as low as 64 seconds. It seems like a 6 hours difference (from the logs) and I don't know why it happens. This happens simultaneously in all three nodes.

knz commented 1 year ago

Can you check the time inside the container. One way to do that would be to tail the cockroach logs when the nodes are started simultaneously. The timestamps should be approximately the same. It might be that your container config is off.

gregory112 commented 1 year ago

They are the same yes, I have tailed both logs

gregory112 commented 1 year ago

If my hardware clock suddenly jumps 6 hours ahead/backward, and then it gets synced again by NTP, would the error go away or persist until next restart? Because I have never seen the time jumps that far.

knz commented 1 year ago

Here are the ways we know that jumps can be observed:

Note that a regular background ntpd would not correct a sudden jump of 6 hours (because it'd be too large) instead this would need to be corrected by a one-off invocation of ntpdate.

Finally, to your question: in some cases (most commonly computer going to sleep), cockroachdb will need some time to recover even after observing the right time, if I recall correctly at least 10 minutes.

gregory112 commented 1 year ago

@knz Thank you for your response. I have three nodes (three virtual machines). Ultimately, I moved all CockroachDB nodes into a single virtual machine, one that I suspect has no problem with the clock. So far I have not observed any problems with the time as all nodes are in the same VM. I have not seen any time jump too in all of the VMs, even though CockroachDB reported time difference, which is confusing. I will try to find other ways to troubleshoot this and will update this issue as soon as I can.

gregory112 commented 1 year ago

I do still get this error, and after having set up a Grafana dashboard to watch the metrics I see this Around 15:10 to 15:20 there seems to be a spike in clock offset, but it indicates only 332 ms offset, which should be tolerable (under 500ms). But I then receive error like remote wall time is too far ahead (3h39m ... in all SQL queries.

Screenshot from 2023-07-01 16-45-20

knz commented 1 year ago

15:10 to 15:20 there seems to be a spike in clock offset, but it indicates only 332 ms offset

It's possible the spike was larger but it wasn't caught in the metrics (we have 10 second resolution).

gregory112 commented 3 months ago

Sorry for the long reply. I have been investigating this. I have redeployed CockroachDB in Kubernetes now instead of Nomad, using the helm chart. The problem still persists even in v23.2. I have tried running a script to detect if there are time jumps, but no, so far I have not seen any time jump. The VMs where CockroachDB are running do have timesyncd enabled too, and even if there was a jump, it was not very far, not like reported in the logs: error writing time series data: remote wall time is too far ahead (1h38m45.093886778s) to be trustworthy. Sometimes the number goes to even more than 1h, like 9 hours even.

gregory112 commented 3 months ago

I also note that I do get this error after restoring the database after some time, even in a single node cluster.