In https://github.com/cockroachdb/cockroach/pull/129123 we're introducing an end-to-end PTP test.
As part of this test, we run CRDB configured to use the ptp device, and wind the system clock back by an hour while CRDB is operating. The assumption is that this has no impact on the running cluster.
However, we do see gRPC connections breaking and queries failing for a period of ~tens of seconds, i.e. the cluster is experiencing a failover. It seems to operate "normally" after that, though the test wouldn't catch subtle problems a customer well might. The test works around this by sleeping for a minute after the clock change, and only then assuming that the cluster "should work".
An audit shows that there are hundreds of non-test callers of timeutil.Now() which is not aware of the PTP clock. It is anyone's guess which problems occur when these callers receive their time off the regular clock source, which may give an entirely different clock signal than the PTP clock.
This is a brittle situation.
To Reproduce
run the PTP roachtest (suitably modified) and peruse the callers to timeutil.Now().
There may be additional callers in our dependencies that simply can't be made aware of the PTP clock.
Expected behavior
Unclear; do we expect everyone to use the PTP clock? Or are we expecting operators to synchronize the system clock to the PTP clock? In that case though, what is the benefit of using the PTP clock directly. One may wonder whether this feature should have been built.
Describe the problem
In https://github.com/cockroachdb/cockroach/pull/129123 we're introducing an end-to-end PTP test. As part of this test, we run CRDB configured to use the ptp device, and wind the system clock back by an hour while CRDB is operating. The assumption is that this has no impact on the running cluster.
However, we do see gRPC connections breaking and queries failing for a period of ~tens of seconds, i.e. the cluster is experiencing a failover. It seems to operate "normally" after that, though the test wouldn't catch subtle problems a customer well might. The test works around this by sleeping for a minute after the clock change, and only then assuming that the cluster "should work".
An audit shows that there are hundreds of non-test callers of
timeutil.Now()
which is not aware of the PTP clock. It is anyone's guess which problems occur when these callers receive their time off the regular clock source, which may give an entirely different clock signal than the PTP clock.This is a brittle situation.
To Reproduce
run the PTP roachtest (suitably modified) and peruse the callers to
timeutil.Now()
. There may be additional callers in our dependencies that simply can't be made aware of the PTP clock.Expected behavior
Unclear; do we expect everyone to use the PTP clock? Or are we expecting operators to synchronize the system clock to the PTP clock? In that case though, what is the benefit of using the PTP clock directly. One may wonder whether this feature should have been built.
Additional data / screenshots https://github.com/cockroachdb/cockroach/pull/129123 has a few example errors
While building the PTP test, I whipped up (but did not merge) a quick and dirty PR to use the correct clock source in Gossip: https://github.com/cockroachdb/cockroach/pull/130496
Environment:
master
at time of writingJira issue: CRDB-41839