cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

hlc: support dynamic uncertainty bounds #75564

Open erikgrinaker opened 2 years ago

erikgrinaker commented 2 years ago

The availability of accurate clocks is improving via products such as the OpenCompute Time Appliance and AWS Time Sync, which can synchronize time within a data center via protocols such as PTP and NTP. Like Google's TrueTime, these often expose an error bound for every clock reading, typically well below 1ms. Even public NTP services across the Internet, such as Google's time service, can have error bounds of ~1ms when using e.g. Chrony.

CRDB currently uses a statically configured max-offset value that gives an upper bound on the accepted clock skew between nodes, defaulting to 500ms. This means that e.g. uncertainty restarts must always assume the worst-case of 500ms, which can lead to significant performance degradation under heavy read/write contention.

We should extend CRDB to support dynamic uncertainty bounds for every HLC timestamp reading (e.g. by adding one or two new fields to the hlc.Timestamp struct), and use these instead of max-offset to determine uncertainty. This could significantly improve performance with read/write contention and non-blocking transactions, and also make a commit-wait transaction model like Google's Spanner viable. However, these uncertainty bounds should possibly not be persisted in MVCC timestamps, to decouple the HLC and MVCC time domains.

Of course, this assumes that the reported error bounds are in fact accurate. This is further complicated by issues such as VM migration pauses (e.g. Google Cloud live migrations and VMWare vMotion) or OS scheduling pauses which can interfere with clock readings and error bounds. This will need further research, but even with significant safety margins these dynamic bounds are expected to be significantly lower than 500 ms (if they weren't, we would already have problems with the 500ms max-offset default). And this should be offered as an alternative to the existing max-offset setting, so it would be up to operators to determine the accuracy of their clock infrastructure and choose an appropriate mechanism.

Jira issue: CRDB-12728

erikgrinaker commented 2 years ago

However, these uncertainty bounds should most likely not be persisted in MVCC timestamps, to decouple the HLC and MVCC time domains.

Had an internal Slack chat about this, and I'm not convinced this can be avoided. Consider the following sequence of operations by writer A and reader B, where T is true time and HLC timestamps and MVCC reads have uncertainty intervals but MVCC writes/versions don't:

T=5   A: HLC=8±3 -> Put foo@8=bar
T=6   B: HLC=6±1 -> Get foo@6±1 -> nil

This is a linearizability violation. We avoid this problem currently because all clock readings have identical uncertainty intervals. I think Spanner's commit-wait gets away with this because it holds onto locks while it's waiting out the uncertainty.

erikgrinaker commented 2 years ago

@nvanbenschoten reminds me that follower reads must also be adjusted for this, similarly to how non-blocking txns must for future-time writes:

the clock uncertainty bound [...] dictates how stale a read has to be to be served by a follower replica on a non-GLOBAL table. A follower read needs both its read timestamp and its global_uncertainty_limit below the closed timestamp of a follower during a follower read. [...] Primarily, it means that AS OF SYSTEM TIME follower_read_timestamp() is further in the past. Additionally, it means that long-running txns have to wait longer before they naturally begin performing follower reads.

bdarnell commented 2 years ago

(that slack link doesn't work for me)

I think Spanner's commit-wait gets away with this because it holds onto locks while it's waiting out the uncertainty.

That may be what makes it work, but also remember that spanner sees time as a range. It generally operates at one of the endpoints of the range, not the midpoint. And it doesn't have to pick the same endpoint for writes as it does for reads. If writes go to the lower bound of the range and reads use the upper bound, that example works (this one works even if reads go to the lower bound). But I'm not sure that generalizes to other cases. (I think it at least needs to be more complicated than using one endpoint for reads and one for writes. For example, putting a different value in the timestamp cache than the one that is actually used for the reads. But it might do enough to let you store point timestamps instead of ranges in all persisted locations)

erikgrinaker commented 2 years ago

Good point, that seems worth exploring. I think something along those lines is the only viable path forward here (short of commit-wait) -- leaking uncertainty intervals down into MVCC isn't particularly appealing.

(that slack link doesn't work for me)

Ah, private channel -- ping me on Slack if you'd like an invite, but this issue covers the gist of it.

nvanbenschoten commented 2 years ago

It generally operates at one of the endpoints of the range, not the midpoint. And it doesn't have to pick the same endpoint for writes as it does for reads. If writes go to the lower bound of the range and reads use the upper bound

This seems sound to me. Concretely, we would assign a transaction's initial ReadTimestamp/WriteTimestamp to the lower bound of the uncertainty window and its GlobalUncertaintyLimit to the upper bound of the uncertainty window.

It's instructive to consider the other places where we consult the HLC and use the maximum clock offset as well. To do so, let's define the clock's uncertainty interval as [T_earliest, T_latest], where T_earliest = TrueTime - ClockErrorBound and T_latest = TrueTime + ClockErrorBound.

Observed timestamps: these would continue to consult HLC.Now(), which would return T_earliest. The one caveat here is that HLC clock readings need to continue to provide strict monotonicity, even if the underlying clock service does not provide such a guarantee on a single node. The HLC clock also needs to continue to support causality tracking through an Update method.

In other words, the HLC needs to stay, but with a few changes, we can place a time-service with dynamic uncertainty (like https://github.com/aws/clock-bound) below it.

Non-cooperative lease transfers: a leaseholder should enter the stasis period as soon as T_latest exceeds the lease expiration.

Commit wait: by default, transactions need to commit wait in TxnCoordSender.maybeCommitWait until Txn.WriteTimestamp exceeds T_earliest. As the comment there discusses, this is trivially the case for transactions that did not interact with future-time operations. For those that did, some waiting will be involved.

If the "linearizable" mode is enabled, I think transactions need to commit wait until Txn.WriteTimestamp exceeds T_latest. However, I haven't been able to convince myself that this alone eliminates causal reverse. It feels like writing transactions may need to start with a write timestamp of T_latest instead of T_earliest (mandating a refresh at some point). Otherwise, I don't see how we guarantee monotonicity of the commit timestamp of non-overlapping txns with a happens-before relationship by just looking at T_earliest on different nodes. Clearly, this part needs more thought.

erikgrinaker commented 2 years ago

Thanks for writing this up @nvanbenschoten, this all sounds viable.

The one caveat here is that HLC clock readings need to continue to provide strict monotonicity, even if the underlying clock service does not provide such a guarantee on a single node.

This bit is interesting. I suppose it would be possible for the HLC to lead T_latest as returned by the clock (if the uncertainty is small), in which case the uncertainty would become 0. That doesn't seem safe.

If the "linearizable" mode is enabled, I think transactions need to commit wait until Txn.WriteTimestamp exceeds T_latest. However, I haven't been able to convince myself that this alone eliminates causal reverse. It feels like writing transactions may need to start with a write timestamp of T_latest instead of T_earliest (mandating a refresh at some point).

With Spanner's model at least, an essential difference is that they don't pick the commit timestamp until after they've acquired locks, so there would be no refresh. They acquire locks, pick T_latest as WriteTimestamp, and then wait for Now().T_earliest to pass WriteTimestamp. I'm not sure if we'd want to mix the current model, where readers (and by extension read/writers) take responsibility for linearizability, with commit wait.

erikgrinaker commented 2 years ago

The one caveat here is that HLC clock readings need to continue to provide strict monotonicity, even if the underlying clock service does not provide such a guarantee on a single node.

This bit is interesting. I suppose it would be possible for the HLC to lead T_latest as returned by the clock (if the uncertainty is small), in which case the uncertainty would become 0. That doesn't seem safe.

On second thought, this would be safe as long as every timestamp fed into Update() can be traced back to a T_earliest HLC reading somewhere. All such readings are guaranteed to be before the real time, and by extension they are guaranteed to be before any T_latest anywhere in the cluster.

nvanbenschoten commented 2 years ago

On second thought, this would be safe as long as every timestamp fed into Update() can be traced back to a T_earliest HLC reading somewhere. All such readings are guaranteed to be before the real time, and by extension they are guaranteed to be before any T_latest anywhere in the cluster.

This reasoning sounds correct to me.

For what it's worth, this is a rough version of the API I was imagining the HLC shifting to with dynamic uncertainty bounds. Note the typing:

package hlc

func (*Clock) Now() Timestamp

func (*Clock) NowAsClockTimestamp() ClockTimestamp

func (*Clock) NowWithUncertainty() (ClockTimestamp, Timestamp)

func (*Clock) Update(ClockTimestamp)

Some of this already exists, but the NowWithUncertainty method would be new and would replace the existing MaxOffset method. We could actually make this API change today, even with a static max clock offset.

func (c *Clock) NowWithUncertainty() (earliest ClockTimestamp, latest Timestamp) {
    now := c.NowAsClockTimestamp()
    return now, now.ToTimestamp().Add(c.maxOffset.Nanoseconds(), 0)
}

That would lay the groundwork to later slot in a dynamic uncertainty bound into the HLC.

erikgrinaker commented 2 years ago

For what it's worth, this is a rough version of the API I was imagining the HLC shifting to with dynamic uncertainty bounds. Note the typing:

Yeah, I was thinking something similar -- I like how the typing enforces the HLC flow. Shouldn't be too hard to implement this behind a cluster setting or environment variable (which then would have to disable the MaxOffset checks too), seems worthwhile for 22.2.

bdarnell commented 2 years ago

All such readings are guaranteed to be before the real time, and by extension they are guaranteed to be before any T_latest anywhere in the cluster.

Putting this "guarantee" another way, if we ratchet the HLC past our T_latest value based on input from another process, this is proof that we have either determined the uncertainty window incorrectly or an out-of-bounds clock jump has occurred. What can we do when (not if) this happens? How safe would it be to increase the uncertainty bounds at this point (and maybe persist this new knowledge to ensure that the uncertainty algorithm becomes more conservative/pessimistic in the future)?

erikgrinaker commented 2 years ago

Putting this "guarantee" another way, if we ratchet the HLC past our T_latest value based on input from another process, this is proof that we have either determined the uncertainty window incorrectly or an out-of-bounds clock jump has occurred. What can we do when (not if) this happens? How safe would it be to increase the uncertainty bounds at this point (and maybe persist this new knowledge to ensure that the uncertainty algorithm becomes more conservative/pessimistic in the future)?

Great point. It seems like we may want to reject messages that would ratchet the HLC past the local T_latest, since either we or they must have an incorrect clock reading. Of course, this means that linearizability breaks since a value can be written beyond any other T_latest, but that's what happens with inaccurate clocks (even with max offset or commit-wait). Or perhaps we'd be protected against at least major single-node inaccuracies because the Raft messages violate the T_latest check and get rejected.

But this is interesting in that inaccurate T_earliest readings will still propagate throughout the cluster as long as they are below the local T_latest. So if another node has a lenient T_latest it could allow ratcheting its HLC T_earliest past other T_latest readings, which would spread the linearizability violation to writes performed by this node as well. So we should also guard against the local T_latest falling below the HLC T_earliest by e.g. self-terminating. But it's going to be hard to completely guard against this, and it seems like it will need some careful thinking.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!