cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.96k stars 3.79k forks source link

util/hlc: Consider enabling extra clock protections by default #25109

Open bdarnell opened 6 years ago

bdarnell commented 6 years ago

@vijaykarthik-rubrik added two new settings to perform additional validation of the clock behavior in #23717 and #23744. These happened so late in the 2.0 cycle that we decided to leave them off by default for that release. We should decide whether we want to change that for 2.1 and enable one or both of them by default.

These settings are probably good ideas for production deployments but might cause problems or confusion in development environments. Whichever way we decide to set the defaults, we'll need documentation about changing them.

Jira issue: CRDB-5730

tbg commented 6 years ago

There's also the approach championed in https://www.youtube.com/watch?v=YqNGbvFHoKM: If you decide that your own clock is too far ahead (via our clock offset computation or the like) you stop using your wallclock. Instead, all you do is ratchet up the logical. This sounds dangerous because now you're running the risk of overflowing that, but that might be largely theoretical. We use 32 bits for the logical component and the other nodes will periodically give us new timestamps from their wall clocks. I think this would obviate the need for persist_upper_bound.

mvijaykarthik commented 6 years ago

@tschottdorf persist_upper_bound is useful for single node clusters where the clock jumps back when cockroach is down. After a restart HLC loses it's state. Roachtest validating this: https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/clock_monotonic.go#L113

bdarnell commented 6 years ago

@tschottdorf I don't understand how that obviates the need for persist_upper_bound. As @vijaykarthik-rubrik says, the point of persist_upper_bound is to protect against backwards clock jumps while the process is not running. I don't really see an alternative to persisting this value if you can't trust your system clock to be roughly monotonic.

tbg commented 6 years ago

You're both right, I was confused because what persist_upper_bound does is more of a persist_lower_bound for me: it's the lower bound for the clock on a restart. (There's a similar justification for the current name, so I don't think it should be renamed).

In unrelated news, we should also be using the monotonic time signal in general. I.e., instead of

func UnixNano() int64 {
    return timeutil.Now().UnixNano()
}

we should use


var UnixNano = func() func() int64 {
    initNow := time.Now() // not timeutil.Now() -- that calls UTC() and removes monotonic reading :-(
    initNanos := initNow.UnixNano()
    return func() int64 {
        return initNanos + int64(time.Now().Sub(initNow))
    }
}

That way, if the clock is right at startup (and if it isn't, we hopefully find out soon thereafter) any ntp shenanigans later won't impact it as much, with the only problem being the real clock drift.

It's a lot easier to control the clock only when the node starts. For example, deployments may run a basic sanity check against some external server to make sure the wallclock isn't completely off before starting a server. This is a lot easier than setting NTP up correctly.

petermattis commented 6 years ago

That way, if the clock is right at startup (and if it isn't, we hopefully find out soon thereafter) any ntp shenanigans later won't impact it as much, with the only problem being the real clock drift.

The real clock drift can be significant. I don't think we could use this as is without a periodic resync ala the timestamp oracle proposal.

tbg commented 6 years ago

I wasn't suggesting that, but isn't it a lot better than not using the monotonic clock?

On Sat, May 12, 2018, 11:30 AM Peter Mattis notifications@github.com wrote:

That way, if the clock is right at startup (and if it isn't, we hopefully find out soon thereafter) any ntp shenanigans later won't impact it as much, with the only problem being the real clock drift.

The real clock drift can be significant. I don't think we could use this as is without a periodic resync ala the timestamp oracle proposal.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/25109#issuecomment-388562789, or mute the thread https://github.com/notifications/unsubscribe-auth/AE135KbV3HU8YtdeuVk5i5qijKJnQGzMks5txwAUgaJpZM4TnHnv .

--

-- Tobias

bdarnell commented 6 years ago

The clock can drift in both directions. The built-in monotonic time signal that we're discarding only helps with backwards movement, and it's only valid for comparisons within a single process, which makes it redundant with the monotonic ratchet we're already doing.

tbg commented 6 years ago

Ah, the monotonic clock reading isn't at all what I thought. Got it.

petermattis commented 6 years ago

Ditto what Ben said. The monotonic clock doesn't protect against a clock that is too fast and you could get into a situation where one node in the cluster slowly advances faster than the others until it exceeds the max offset and is killed.

tbg commented 6 years ago

If we used what I describe above but also regularly inject a delta from Marzullo's algorithm, wouldn't that work? Is that basically what you were thinking for the "timestamp oracle"? It just doesn't seem that we should be using the wall time at all (except for bootstrap) since it has both clock drift and random jumps, and the monotonic clock only has clock drift which hopefully is bounded somewhat (i.e. won't drift more than a few ms per sec). Is there an ingredient here that I'm missing?

petermattis commented 6 years ago

If we used what I describe above but also regularly inject a delta from Marzullo's algorithm, wouldn't that work?

My understanding is that Marzullo's algorithm provides an estimation for that actual time of a remote clock adjusting for network latencies.

Is that basically what you were thinking for the "timestamp oracle"?

I was thinking we'd compute local time as a delta from the last signal from the timestamp oracle where the delta is computed using the monotonic clock. The signal from the timestamp oracle could be adjusted using Marzullo's algorithm.

It just doesn't seem that we should be using the wall time at all (except for bootstrap) since it has both clock drift and random jumps, and the monotonic clock only has clock drift which hopefully is bounded somewhat (i.e. won't drift more than a few ms per sec). Is there an ingredient here that I'm missing?

Without adjustment, a drift of 1 ms per sec would add up to 500ms in a mere 8m20s. I'm not sure if we can use the monotonic clock without adding in some periodic adjustment. The random jumps shouldn't happen in properly configured NTP, though we're seeing how difficult that is to achieve in some environments.

bdarnell commented 6 years ago

If we used what I describe above but also regularly inject a delta from Marzullo's algorithm, wouldn't that work?

In other words, an unprivileged/userspace implementation of NTP? I think it could work, but I would have less faith in such a system than actually using NTP.

The signal from the timestamp oracle could be adjusted using Marzullo's algorithm.

I wasn't thinking of anything that clever. The point of the timestamp oracle IMHO is to take all the smarts out of figuring out the current time, substituting extra communication for any algorithmic smarts or local timekeeping.

petermattis commented 6 years ago

In other words, an unprivileged/userspace implementation of NTP? I think it could work, but I would have less faith in such a system than actually using NTP.

Why? It's not like NTP is magical or poorly understood. The underlying algorithms are straightforward.

I wasn't thinking of anything that clever. The point of the timestamp oracle IMHO is to take all the smarts out of figuring out the current time, substituting extra communication for any algorithmic smarts or local timekeeping.

Adjusting the timestamp oracle signal using Marzullo's algorithm is definitely a second-order effect. It wouldn't be needed for single DC clusters, but would be important for multi-DC clusters where network latencies are more significant.

This is all assuming that we're talking about the same timestamp oracle idea: an oracle that sends a timestamp signal every few seconds. A timestamp oracle that is consulted on every request would be a different beast.

bdarnell commented 6 years ago

Why? It's not like NTP is magical or poorly understood. The underlying algorithms are straightforward.

NTP has had years of evolution. It's certainly not impossible to replicate, but it's one of those projects that is inevitably much larger than it initially appears.

This is all assuming that we're talking about the same timestamp oracle idea: an oracle that sends a timestamp signal every few seconds. A timestamp oracle that is consulted on every request would be a different beast.

Whenever I've said "timestamp oracle", I've meant the latter.

garvitjuniwal commented 6 years ago

[Unrelated to monotonic clocks discussion]

Lets say we have a single node cluster with both server.clock.persist_upper_bound_interval and server.clock.forward_jump_check_enabled enabled. A common pattern of clock jumps that I have seen (specially on VMs) is that the clock jumps forward by several hours. Then after a few minutes, it gets corrected back to the right time by NTP. When the forward jump happens, the process will crash itself. When it comes back up, it has no good way to know that the clock is jumped forward and will start using the jumped timestamps, including persisting a forward jumped value to max-upper-bound. Lets say the clock is corrected by NTP and the process is restarted again for some unrelated reason. When the cluster comes back up, it would have to become unavailable until the physical time is past the last max-upper-bound which can be a long time. This will also be the case if the process restarts after any large backward jump. To prevent the unavailability, can we just initialize the WallTime in hlc to the last max-upper-bound found from the persistent store? This would make the process act as if it never crashed after the corrective backward jump, and it can become available immediately.

The same reasoning applies to a multi-node cluster where all nodes have experienced similar jumps at similar times - although I don't know how common that scenario really is, unless you have multiple VMs on the same host.

For a multi-node cluster, where only one node experiences a forward jump, after it crashes, it would not be able to use a jumped timestamp because the clock offset check on the first heartbeat will cause it to crash. This is actually good, because we completely prevent usage of potentially bad timestamps, and we can rely on NTP to correct the glitch within several minutes.

bdarnell commented 6 years ago

That would introduce the opposite problem: If you assume that a system clock that is higher than the persisted time is a clock jump that should be rejected, it would be impossible for a node to recover from downtime longer than the max offset (without manual intervention). In general, forward clock movement is far more likely to reflect the passage of real time than an inappropriate jump.

garvitjuniwal commented 6 years ago

I didn't follow that. Why would a system clock higher that persisted time be rejected? WallTime will be initialized to the last persisted max-upper-bound, and if the physical clock has moved forward, WallTime will be updated to the new physical time on the first call to hlc.Now().

bdarnell commented 6 years ago

You said:

For a multi-node cluster, where only one node experiences a forward jump, after it crashes, it would not be able to use a jumped timestamp because the clock offset check on the first heartbeat will cause it to crash.

How would it distinguish this "jumped" timestamp from the normal passage of time while the process wasn't running?

garvitjuniwal commented 6 years ago

In a multi-node setup, if one node experiences a jump, when it comes back up, it will check its clock offset relative to other nodes's clocks. I assume that the server waits for an initial heartbeat before it becomes available, and the first heartbeat itself will cause it to crash due to skew.

bdarnell commented 6 years ago

The server waits for an initial heartbeat with each peer, but the cluster-wide offset estimation happens separately. There's no point at which that process is "done"

garvitjuniwal commented 6 years ago

OK. I previously assumed that the initial heartbeat will do the clock offset estimation and check as well. But even if it does not, the consequence is that a high timestamp can get used, and also persisted as max-upper-bound. Even with that, initializing hlc WallTime with last max-upper-bound is just equivalent to the process never crashing after using the forward jumped clock. Is the concern that after some sequence of jumps and restarts, this may lead to a situation where physical clocks are actually in sync but WallTime on some nodes is artificially higher than the rest?

garvitjuniwal commented 6 years ago

Also, if node startup is not blocked by a clock offset check, how much value does server.clock.forward_jump_check_enabled provide? Even if it causes a process to crash after observing a forward jump, the next incarnation of the process can happily use the forward jumped clock.

bdarnell commented 6 years ago

Even with that, initializing hlc WallTime with last max-upper-bound is just equivalent to the process never crashing after using the forward jumped clock.

We update the HLC clock with the current system time every time we use it, so changing the way we initialize it doesn't change anything.

Also, if node startup is not blocked by a clock offset check, how much value does server.clock.forward_jump_check_enabled provide?

@vijaykarthik-rubrik can answer that.

garvitjuniwal commented 6 years ago

We update the HLC clock with the current system time every time we use it, so changing the way we initialize it doesn't change anything.

I'm looking at this code from util/hlc/hlc.go It seems that WallTime is not updated when the physical time is behind.

    if physicalClock := c.getPhysicalClockLocked(); c.mu.timestamp.WallTime >= physicalClock {
        // The wall time is ahead, so the logical clock ticks.
        c.mu.timestamp.Logical++
    } else {
        // Use the physical clock, and reset the logical one.
        c.mu.timestamp.WallTime = physicalClock
        c.mu.timestamp.Logical = 0
    }
bdarnell commented 6 years ago

I'm looking at this code from util/hlc/hlc.go It seems that WallTime is not updated when the physical time is behind.

True, but the case we're talking about here is when the system clock is ahead.

garvitjuniwal commented 6 years ago

My intention with initialize the WallTime in hlc to the last max-upper-bound found from the persistent store was to prevent unavailability when the clock first jumps forward causing a high max-upper-bound to be persisted, and then jumps backward and cockroach is restarted. Current behavior of the flag is to just wait until physical time goes past the last max-upper-bound.

bdarnell commented 6 years ago

In that case I'm not understanding what you're suggesting here. Let's spell out the sequence:

  1. The server is running as PID 1, at time t1
  2. The system clock jumps forward to time t10
  3. The forward-jump detection fires, so PID 1 crashes without persisting a new timestamp. Our persisted upper bound remains t1.
  4. The server restarts as PID 2. The clock still shows time t10.
  5. Currently, we initialize the HLC to t10 and persist it as a new upper bound. My understanding of your suggestion is to initialize the HLC to the persisted time t1, but without a larger refactoring we will immediately update the hlc to t10 from the system clock. If your intention is that this scenario would trigger a "forward clock jump" panic, then you'd prevent the problem we're discussing here, but we'd have the worse problem that any server that had been down for longer than the "forward jump" threshold would never be able to start again.

But it sounds like I've misunderstood something, so could you explain the suggestion again?

garvitjuniwal commented 6 years ago

Let me spell out what I meant.

Scenario 1

Current behavior

  1. The server is running as PID 1, at time t1. max-upper-bound is persisted as t1
  2. The system clock jumps backward to time t(-10)
  3. The server restarts as PID 2 due to other reasons (say a reboot which caused the jump as well)
  4. The clock shows time t(-10) so the server waits until the previous max-upper-bound t1, causing unavailability. The unavailability will last for however long it takes NTP to correct the backward jump (at least 15 minutes, can be an hour or so).

Behavior with suggested change

  1. The server is running as PID 1, at time t1. max-upper-bound is persisted as t1
  2. The system clock jumps backward to time t(-10)
  3. The server restarts as PID 2 due to other reasons (say a reboot which caused the jump as well)
  4. The server sets its WallTime to t1 based on the max-upper-bound, and opens for business. Logical part of the clock is incremented until the physical clock reaches t1. With a 32 bit logical clock, you can perform clock reads once per 1 ms for about a month (why is the logical clock 32 bit and not 64 bit?)

Scenario 2

Current behavior

  1. The server is running as PID 1, at time t1.
  2. Clock jumps forward to t10. Server restarts due to forward jump detection.
  3. After restart, server comes up as PID 2, and starts using time t10 happily. Note that the forward jump detection is based on consecutive readings of the physical clock and is not based on the value of WallTime. max-upper-bound is persisted as t10.
  4. A reboot happens, time is corrected back to say t2, and the server restarts once more as PID 2.
  5. Analogous to the scenario above, server waits until physical time is at t10, the last max-upper-bound and becomes unavailable. The unavailability will last for however long the forward jump was. This can be of the order of days.

Behavior with suggested change

Similar to above, PID 2 will start WallTime at t10, opens for business, and increments the logical part until physical time catches up.

This is definitely helpful in single node clusters. In multi-node clusters, it is helpful when all nodes jump at the same time. If only 1 node jumps in a multi-node cluster, I was hoping that clock offset detection should keep the node out after a restart, need to think a bit more if that's not the case.

bdarnell commented 6 years ago

OK, I understand now. Concretely, then, you're suggesting that ensureClockMonotonicity be changed to force-update the HLC instead of sleeping. This puts us in roughly the same state we'd be in if the clock jumped backwards during our process.

The problem is that if we update our HLC from the persisted upper bound, in order to make the node available, we need to construct a new upper bound that is greater than the previous one and persist that. If we get into a crash loop (which is likely while this kind of clock trouble is going on), the upper bound could creep forward arbitrarily far ahead of real time, so even when NTP recovers you may not get back to normal. (this clock creep would happen independently on each node, so they my report out-of-sync times when the system clocks are in sync)

Remember that the features we're discussing here are designed to prevent corruption in the face of clock jumps, not to preserve availability with misbehaving clocks. The latter is not something we're prepared to support at this time.

With a 32 bit logical clock, you can perform clock reads once per 1 ms for about a month (why is the logical clock 32 bit and not 64 bit?)

HLC timestamps are written very frequently, so size matters. 32 bits for logical ticks are plenty as long as clocks are performing within our requirements.

garvitjuniwal commented 6 years ago

The problem is that if we update our HLC from the persisted upper bound, in order to make the node available, we need to construct a new upper bound that is greater than the previous one and persist that. If we get into a crash loop (which is likely while this kind of clock trouble is going on), the upper bound could creep forward arbitrarily far ahead of real time, so even when NTP recovers you may not get back to normal. (this clock creep would happen independently on each node, so they my report out-of-sync times when the system clocks are in sync)

Fair point. I think it would still be ok to do this on clusters that are going to be single node forever, but that seems to enough of an edge case to not warrant this change.

garvitjuniwal commented 6 years ago

The problem is that if we update our HLC from the persisted upper bound, in order to make the node available, we need to construct a new upper bound that is greater than the previous one and persist that. If we get into a crash loop (which is likely while this kind of clock trouble is going on), the upper bound could creep forward arbitrarily far ahead of real time, so even when NTP recovers you may not get back to normal. (this clock creep would happen independently on each node, so they my report out-of-sync times when the system clocks are in sync)

On second thought, I no longer think this is a problem. We do not need to persist an advanced max-upper-bound if the physical clock is behind WallTime. We just need to persist an upper bound on the logical part of the clock (or some separate epoch) to make sure it advances across restarts. Then we will purely be limited by how far the logical clock can take us before it overflows.

bdarnell commented 6 years ago

If you prevent the HLC's wall time from advancing while in this state, then old leases won't expire, so you're back to waiting until real time catches up with the persisted time before you can have any reliable availability.

github-actions[bot] commented 3 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

tbg commented 2 years ago

I just abandoned another attempt at adjusting the defaults (#75157) because:

The discussion on the PR has the details.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!