Open rmloveland opened 4 years ago
@taroface does this fall under your area?
@rmloveland I believe so. I will put this on my plate!
@knz Can you help answer the questions above?
That's a good question! And a question for @bdarnell and @andreimatei actually.
If I were to set server.clock.persist_upper_bound_interval
I guess I'd set it to 1s. This means that on server startup we'll wait up to 3*1s to make sure that the clock used by the new incarnation is ahead of the highest clock value that might have been used by the old incarnation, but this wait period is amortized by how long it takes us to restart the process.
But I probably wouldn't set this at all because what it does is confusing and the case for which I think it protects against seems very unlikely to me - a clock jumping backwards in between process restarts, and jumping backwards by more that the amount of time it took the process to restart.
The Recommended Production Settings page kinda talks about this setting in the wrong context - it talks about it in the context of clocks jumping forwards but this is about the clock jumping backwards (I think). The setting is not particularly related to server.clock.forward_jump_check_enabled
.
Now about server.clock.forward_jump_check_enabled
I also don't have the warmest feelings. Maybe @bdarnell will defend it. The docs page says that it's about forward clock jumps, such as the ones caused by VMotions. But in the case of a VMotion, we like the forward clock jump. In fact, we have the opposite problem - that the clock doesn't jump forward quickly enough after the VMotion is done and we read stale values. The protection for that is the newer --clock-device=<dev>
flag in conjunction with an ESX driver that gives the virtual machine access to the guest's clock. I don't know if the version of ESX supporting this is out yet.
So I think server.clock.forward_jump_check_enabled
is about protection against mis-behaving NTP servers that they themselves jump forward and then cause our nodes to jump, but in a staggered manner. At least that's what I gathered from this thread.
I don't think I would use the setting; I think it's too unpredictable and hard to reason about. Instead I'd just recommend for people to monitor our clock offset timeseries.
If I were to set server.clock.persist_upper_bound_interval I guess I'd set it to 1s.
I agree.
the case for which I think it protects against seems very unlikely to me
Unlikely, yes. But this is a small price to pay for the guarantee that this unlikely event won't cause stale reads or other misbehavior. Note that both of these settings were introduced by a customer who said they had seen both kinds of problems in the wild.
such as the ones caused by VMotion
Not only vmotion - GCE's live migration has the same potential issues (with nothing we can hook into via the --clock-device
flag), and IIRC the customer that introduced this setting wasn't in a virtualized environment at all.
I don't think I would use the setting; I think it's too unpredictable and hard to reason about. Instead I'd just recommend for people to monitor our clock offset timeseries.
I think both of these settings are important because they're so much easier to understand than our clock offset monitoring. With these settings we directly monitor clock misbehavior on a single node and detect it quickly (and synchronously) so it doesn't have a chance to introduce data anomalies. Our other clock sync monitoring is based on periodic polling, feeding that data into relatively complex aggregation, and detecting issues seconds after the fact. We still need to do that to detect slowly drifting offsets between nodes, but for jumps we can do much better.
In fact, I've long been interested in enabling these settings by default, so maybe we should do that and then the docs would be about how to opt-out. I think the persist-upper-bound setting is pretty safe to enable by default. The forward jump detection would be more annoying since it would also crash single-node dev clusters on laptops or suspended VMs. Maybe now that we have start-single-node
we tie the defaults to that, so multi-node clusters get strict clock monitoring but single nodes don't.
such as the ones caused by VMotion
Not only vmotion - GCE's live migration has the same potential issues (with nothing we can hook into via the --clock-device flag), and IIRC the customer that introduced this setting wasn't in a virtualized environment at all.
Just to make sure you're not missing my point: when there's vmotion/live migration happening, you want/expect the clock jump. In fact you want it as quickly as possible after the migration. So in these environments, you probably don't want to enable the protection and crash. Unless you're not trusting the clock to jump quickly enough after the migration and you're relying on the crashes to alert you that you haven't disabled the migrations :). So, if anything, it's the environments that don't have these migrations that in principle benefit from this setting.
Unless you're not trusting the clock to jump quickly enough after the migration and you're relying on the crashes to alert you that you haven't disabled the migrations
Yes, exactly. If the clock jump is not synchronous with the VM migration, there is a risk of a new transaction getting assigned a stale timestamp and violating single-key linearizability. It is important to either disable such migrations or take steps (such as the clock device configuration) to ensure that you're reaching out to the host clock directly and making the jump synchronous with the migration. The default assumption for a multi-node cluster should be that forward clock jumps indicate a problem.
Heh, but a crash is a pretty bad way to "indicate a problem"... Particularly since the harm, if any, has already been done by the time we crash. And, as opposed to other situations where we detect clock skew and crash, here the assumption is that the jump stopped the badness. Seems silly to crash right when the going was starting to get good. We don't have great alerting mechanisms, but we can easily log at the ERROR level, and/or add a timeseries if we don't have them already.
We had another customer hit this today as well: https://github.com/cockroachdb/cockroach/issues/74909 They had some of their VMs temporarily use a wall clock 8 hours in the future. It turns out that even a single such node can in effect tank the cluster, so the downside is essentially unlimited.
Richard Loveland (rmloveland) commented: According to Tobi from the KV on-call meeting just now (please correct me if needed Tobi!), a reasonable value of {{server.clock.persist_upper_bound_interval}} to prevent the 8h badness would be something a few seconds in the future, like {{10s}} or {{30s}}
Yeah, 10s seems very conservative (ie should not cause issues). Before we roll this out in docs, we should also enable that for nightly roachtest.
Also, @joshimhoff, does CC run with this setting? If we're recommending it to our customers, we should also use it internally. Happy to be the eng partner for this but I'll need someone on the cloud side to sign off on it and to roll out the change.
Note that there are two settings here, and they solve different problems. server.clock.forward_jump_check_enabled
is just a boolean, not an interval, and it's the one we need for cockroachdb/cockroach#74909. But it's an incomplete solution - it would prevent one node with a bad clock from messing up the others, and it would prevent problems if any or all nodes experienced forward clock jumps while they're running, but it would still leave the possibility of problems if multiple nodes had their clocks set into the future while they're offline. It's not clear that there's any way to protect against this - the nodes can't tell whether the clocks are erroneously set into the future or if they've just been offline a long time and time has passed naturally.
All production clusters should have forward_jump_check_enabled
. The reason we have not turned it on by default is that on a machine which may get suspended/resumed (e.g. a laptop) this would cause crashes when the process wakes back up. Maybe we should reverse things, though, so we default to the production-ready configuration and then turn off the safety checks for dev usage.
persist_upper_bound_interval
is just a performance tradeoff - the lower the better for detecting clock badness, but it has a slight performance cost. Anything in the 1-10 second range should be fine.
linville (mdlinville) commented: Hey Ben Darnell Did we update the defaults for this? Did we enable the {{10s}} setting in {{roachtest}}? Basically, do you know the status of the docs request here?
Matt, I don't think we did anything.
Richard Loveland (rmloveland) commented:
In the Production Checklist under Clock Synchronization, we recommend:
On the Cluster Settings page, these are listed as
false
and0s
/disabled respectively.It's good that we say users should do something here. However, we should really recommend what to set the
server.clock.persist_upper_bound_interval
to when settingserver.clock_forward_jump_check_enabled
totrue
. If we cannot make a recommendation, we should tell the user how to come up with the value based on some kind of testing.Jira Issue: DOC-576