Open bdarnell opened 6 years ago
There's also the approach championed in https://www.youtube.com/watch?v=YqNGbvFHoKM: If you decide that your own clock is too far ahead (via our clock offset computation or the like) you stop using your wallclock. Instead, all you do is ratchet up the logical. This sounds dangerous because now you're running the risk of overflowing that, but that might be largely theoretical. We use 32 bits for the logical component and the other nodes will periodically give us new timestamps from their wall clocks.
I think this would obviate the need for persist_upper_bound
.
@tschottdorf persist_upper_bound
is useful for single node clusters where the clock jumps back when cockroach is down. After a restart HLC loses it's state. Roachtest validating this: https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/clock_monotonic.go#L113
@tschottdorf I don't understand how that obviates the need for persist_upper_bound
. As @vijaykarthik-rubrik says, the point of persist_upper_bound
is to protect against backwards clock jumps while the process is not running. I don't really see an alternative to persisting this value if you can't trust your system clock to be roughly monotonic.
You're both right, I was confused because what persist_upper_bound
does is more of a persist_lower_bound
for me: it's the lower bound for the clock on a restart. (There's a similar justification for the current name, so I don't think it should be renamed).
In unrelated news, we should also be using the monotonic time signal in general. I.e., instead of
func UnixNano() int64 {
return timeutil.Now().UnixNano()
}
we should use
var UnixNano = func() func() int64 {
initNow := time.Now() // not timeutil.Now() -- that calls UTC() and removes monotonic reading :-(
initNanos := initNow.UnixNano()
return func() int64 {
return initNanos + int64(time.Now().Sub(initNow))
}
}
That way, if the clock is right at startup (and if it isn't, we hopefully find out soon thereafter) any ntp shenanigans later won't impact it as much, with the only problem being the real clock drift.
It's a lot easier to control the clock only when the node starts. For example, deployments may run a basic sanity check against some external server to make sure the wallclock isn't completely off before starting a server. This is a lot easier than setting NTP up correctly.
That way, if the clock is right at startup (and if it isn't, we hopefully find out soon thereafter) any ntp shenanigans later won't impact it as much, with the only problem being the real clock drift.
The real clock drift can be significant. I don't think we could use this as is without a periodic resync ala the timestamp oracle proposal.
I wasn't suggesting that, but isn't it a lot better than not using the monotonic clock?
On Sat, May 12, 2018, 11:30 AM Peter Mattis notifications@github.com wrote:
That way, if the clock is right at startup (and if it isn't, we hopefully find out soon thereafter) any ntp shenanigans later won't impact it as much, with the only problem being the real clock drift.
The real clock drift can be significant. I don't think we could use this as is without a periodic resync ala the timestamp oracle proposal.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/25109#issuecomment-388562789, or mute the thread https://github.com/notifications/unsubscribe-auth/AE135KbV3HU8YtdeuVk5i5qijKJnQGzMks5txwAUgaJpZM4TnHnv .
--
-- Tobias
The clock can drift in both directions. The built-in monotonic time signal that we're discarding only helps with backwards movement, and it's only valid for comparisons within a single process, which makes it redundant with the monotonic ratchet we're already doing.
Ah, the monotonic clock reading isn't at all what I thought. Got it.
Ditto what Ben said. The monotonic clock doesn't protect against a clock that is too fast and you could get into a situation where one node in the cluster slowly advances faster than the others until it exceeds the max offset and is killed.
If we used what I describe above but also regularly inject a delta from Marzullo's algorithm, wouldn't that work? Is that basically what you were thinking for the "timestamp oracle"? It just doesn't seem that we should be using the wall time at all (except for bootstrap) since it has both clock drift and random jumps, and the monotonic clock only has clock drift which hopefully is bounded somewhat (i.e. won't drift more than a few ms per sec). Is there an ingredient here that I'm missing?
If we used what I describe above but also regularly inject a delta from Marzullo's algorithm, wouldn't that work?
My understanding is that Marzullo's algorithm provides an estimation for that actual time of a remote clock adjusting for network latencies.
Is that basically what you were thinking for the "timestamp oracle"?
I was thinking we'd compute local time as a delta from the last signal from the timestamp oracle where the delta is computed using the monotonic clock. The signal from the timestamp oracle could be adjusted using Marzullo's algorithm.
It just doesn't seem that we should be using the wall time at all (except for bootstrap) since it has both clock drift and random jumps, and the monotonic clock only has clock drift which hopefully is bounded somewhat (i.e. won't drift more than a few ms per sec). Is there an ingredient here that I'm missing?
Without adjustment, a drift of 1 ms per sec would add up to 500ms in a mere 8m20s. I'm not sure if we can use the monotonic clock without adding in some periodic adjustment. The random jumps shouldn't happen in properly configured NTP, though we're seeing how difficult that is to achieve in some environments.
If we used what I describe above but also regularly inject a delta from Marzullo's algorithm, wouldn't that work?
In other words, an unprivileged/userspace implementation of NTP? I think it could work, but I would have less faith in such a system than actually using NTP.
The signal from the timestamp oracle could be adjusted using Marzullo's algorithm.
I wasn't thinking of anything that clever. The point of the timestamp oracle IMHO is to take all the smarts out of figuring out the current time, substituting extra communication for any algorithmic smarts or local timekeeping.
In other words, an unprivileged/userspace implementation of NTP? I think it could work, but I would have less faith in such a system than actually using NTP.
Why? It's not like NTP is magical or poorly understood. The underlying algorithms are straightforward.
I wasn't thinking of anything that clever. The point of the timestamp oracle IMHO is to take all the smarts out of figuring out the current time, substituting extra communication for any algorithmic smarts or local timekeeping.
Adjusting the timestamp oracle signal using Marzullo's algorithm is definitely a second-order effect. It wouldn't be needed for single DC clusters, but would be important for multi-DC clusters where network latencies are more significant.
This is all assuming that we're talking about the same timestamp oracle idea: an oracle that sends a timestamp signal every few seconds. A timestamp oracle that is consulted on every request would be a different beast.
Why? It's not like NTP is magical or poorly understood. The underlying algorithms are straightforward.
NTP has had years of evolution. It's certainly not impossible to replicate, but it's one of those projects that is inevitably much larger than it initially appears.
This is all assuming that we're talking about the same timestamp oracle idea: an oracle that sends a timestamp signal every few seconds. A timestamp oracle that is consulted on every request would be a different beast.
Whenever I've said "timestamp oracle", I've meant the latter.
[Unrelated to monotonic clocks discussion]
Lets say we have a single node cluster with both server.clock.persist_upper_bound_interval
and server.clock.forward_jump_check_enabled
enabled. A common pattern of clock jumps that I have seen (specially on VMs) is that the clock jumps forward by several hours. Then after a few minutes, it gets corrected back to the right time by NTP. When the forward jump happens, the process will crash itself. When it comes back up, it has no good way to know that the clock is jumped forward and will start using the jumped timestamps, including persisting a forward jumped value to max-upper-bound
. Lets say the clock is corrected by NTP and the process is restarted again for some unrelated reason. When the cluster comes back up, it would have to become unavailable until the physical time is past the last max-upper-bound
which can be a long time. This will also be the case if the process restarts after any large backward jump. To prevent the unavailability, can we just initialize the WallTime in hlc to the last max-upper-bound
found from the persistent store? This would make the process act as if it never crashed after the corrective backward jump, and it can become available immediately.
The same reasoning applies to a multi-node cluster where all nodes have experienced similar jumps at similar times - although I don't know how common that scenario really is, unless you have multiple VMs on the same host.
For a multi-node cluster, where only one node experiences a forward jump, after it crashes, it would not be able to use a jumped timestamp because the clock offset check on the first heartbeat will cause it to crash. This is actually good, because we completely prevent usage of potentially bad timestamps, and we can rely on NTP to correct the glitch within several minutes.
That would introduce the opposite problem: If you assume that a system clock that is higher than the persisted time is a clock jump that should be rejected, it would be impossible for a node to recover from downtime longer than the max offset (without manual intervention). In general, forward clock movement is far more likely to reflect the passage of real time than an inappropriate jump.
I didn't follow that. Why would a system clock higher that persisted time be rejected? WallTime will be initialized to the last persisted max-upper-bound, and if the physical clock has moved forward, WallTime will be updated to the new physical time on the first call to hlc.Now().
You said:
For a multi-node cluster, where only one node experiences a forward jump, after it crashes, it would not be able to use a jumped timestamp because the clock offset check on the first heartbeat will cause it to crash.
How would it distinguish this "jumped" timestamp from the normal passage of time while the process wasn't running?
In a multi-node setup, if one node experiences a jump, when it comes back up, it will check its clock offset relative to other nodes's clocks. I assume that the server waits for an initial heartbeat before it becomes available, and the first heartbeat itself will cause it to crash due to skew.
The server waits for an initial heartbeat with each peer, but the cluster-wide offset estimation happens separately. There's no point at which that process is "done"
OK. I previously assumed that the initial heartbeat will do the clock offset estimation and check as well. But even if it does not, the consequence is that a high timestamp can get used, and also persisted as max-upper-bound
. Even with that, initializing hlc WallTime with last max-upper-bound
is just equivalent to the process never crashing after using the forward jumped clock. Is the concern that after some sequence of jumps and restarts, this may lead to a situation where physical clocks are actually in sync but WallTime on some nodes is artificially higher than the rest?
Also, if node startup is not blocked by a clock offset check, how much value does server.clock.forward_jump_check_enabled
provide? Even if it causes a process to crash after observing a forward jump, the next incarnation of the process can happily use the forward jumped clock.
Even with that, initializing hlc WallTime with last max-upper-bound is just equivalent to the process never crashing after using the forward jumped clock.
We update the HLC clock with the current system time every time we use it, so changing the way we initialize it doesn't change anything.
Also, if node startup is not blocked by a clock offset check, how much value does server.clock.forward_jump_check_enabled provide?
@vijaykarthik-rubrik can answer that.
We update the HLC clock with the current system time every time we use it, so changing the way we initialize it doesn't change anything.
I'm looking at this code from util/hlc/hlc.go It seems that WallTime is not updated when the physical time is behind.
if physicalClock := c.getPhysicalClockLocked(); c.mu.timestamp.WallTime >= physicalClock {
// The wall time is ahead, so the logical clock ticks.
c.mu.timestamp.Logical++
} else {
// Use the physical clock, and reset the logical one.
c.mu.timestamp.WallTime = physicalClock
c.mu.timestamp.Logical = 0
}
I'm looking at this code from util/hlc/hlc.go It seems that WallTime is not updated when the physical time is behind.
True, but the case we're talking about here is when the system clock is ahead.
My intention with initialize the WallTime in hlc to the last max-upper-bound found from the persistent store was to prevent unavailability when the clock first jumps forward causing a high max-upper-bound
to be persisted, and then jumps backward and cockroach is restarted. Current behavior of the flag is to just wait until physical time goes past the last max-upper-bound
.
In that case I'm not understanding what you're suggesting here. Let's spell out the sequence:
But it sounds like I've misunderstood something, so could you explain the suggestion again?
Let me spell out what I meant.
Scenario 1
Current behavior
Behavior with suggested change
Scenario 2
Current behavior
Behavior with suggested change
Similar to above, PID 2 will start WallTime at t10, opens for business, and increments the logical part until physical time catches up.
This is definitely helpful in single node clusters. In multi-node clusters, it is helpful when all nodes jump at the same time. If only 1 node jumps in a multi-node cluster, I was hoping that clock offset detection should keep the node out after a restart, need to think a bit more if that's not the case.
OK, I understand now. Concretely, then, you're suggesting that ensureClockMonotonicity be changed to force-update the HLC instead of sleeping. This puts us in roughly the same state we'd be in if the clock jumped backwards during our process.
The problem is that if we update our HLC from the persisted upper bound, in order to make the node available, we need to construct a new upper bound that is greater than the previous one and persist that. If we get into a crash loop (which is likely while this kind of clock trouble is going on), the upper bound could creep forward arbitrarily far ahead of real time, so even when NTP recovers you may not get back to normal. (this clock creep would happen independently on each node, so they my report out-of-sync times when the system clocks are in sync)
Remember that the features we're discussing here are designed to prevent corruption in the face of clock jumps, not to preserve availability with misbehaving clocks. The latter is not something we're prepared to support at this time.
With a 32 bit logical clock, you can perform clock reads once per 1 ms for about a month (why is the logical clock 32 bit and not 64 bit?)
HLC timestamps are written very frequently, so size matters. 32 bits for logical ticks are plenty as long as clocks are performing within our requirements.
The problem is that if we update our HLC from the persisted upper bound, in order to make the node available, we need to construct a new upper bound that is greater than the previous one and persist that. If we get into a crash loop (which is likely while this kind of clock trouble is going on), the upper bound could creep forward arbitrarily far ahead of real time, so even when NTP recovers you may not get back to normal. (this clock creep would happen independently on each node, so they my report out-of-sync times when the system clocks are in sync)
Fair point. I think it would still be ok to do this on clusters that are going to be single node forever, but that seems to enough of an edge case to not warrant this change.
The problem is that if we update our HLC from the persisted upper bound, in order to make the node available, we need to construct a new upper bound that is greater than the previous one and persist that. If we get into a crash loop (which is likely while this kind of clock trouble is going on), the upper bound could creep forward arbitrarily far ahead of real time, so even when NTP recovers you may not get back to normal. (this clock creep would happen independently on each node, so they my report out-of-sync times when the system clocks are in sync)
On second thought, I no longer think this is a problem. We do not need to persist an advanced max-upper-bound if the physical clock is behind WallTime. We just need to persist an upper bound on the logical part of the clock (or some separate epoch) to make sure it advances across restarts. Then we will purely be limited by how far the logical clock can take us before it overflows.
If you prevent the HLC's wall time from advancing while in this state, then old leases won't expire, so you're back to waiting until real time catches up with the persisted time before you can have any reliable availability.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
I just abandoned another attempt at adjusting the defaults (#75157) because:
The discussion on the PR has the details.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
@vijaykarthik-rubrik added two new settings to perform additional validation of the clock behavior in #23717 and #23744. These happened so late in the 2.0 cycle that we decided to leave them off by default for that release. We should decide whether we want to change that for 2.1 and enable one or both of them by default.
server.clock.forward_jump_check_enabled
may experience false positives leading to crashes if run on a laptop or a VM that gets suspended and resumed.server.clock.persist_upper_bound_interval
has a performance cost due to extra disk writesThese settings are probably good ideas for production deployments but might cause problems or confusion in development environments. Whichever way we decide to set the defaults, we'll need documentation about changing them.
Jira issue: CRDB-5730