cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.88k stars 3.77k forks source link

*: ptp support insufficient, hundreds of references to system time remain #130010

Open tbg opened 1 week ago

tbg commented 1 week ago

Describe the problem

In https://github.com/cockroachdb/cockroach/pull/129123 we're introducing an end-to-end PTP test. As part of this test, we run CRDB configured to use the ptp device, and wind the system clock back by an hour while CRDB is operating. The assumption is that this has no impact on the running cluster.

However, we do see gRPC connections breaking and queries failing for a period of ~tens of seconds, i.e. the cluster is experiencing a failover. It seems to operate "normally" after that, though the test wouldn't catch subtle problems a customer well might. The test works around this by sleeping for a minute after the clock change, and only then assuming that the cluster "should work".

An audit shows that there are hundreds of non-test callers of timeutil.Now() which is not aware of the PTP clock. It is anyone's guess which problems occur when these callers receive their time off the regular clock source, which may give an entirely different clock signal than the PTP clock.

This is a brittle situation.

To Reproduce

run the PTP roachtest (suitably modified) and peruse the callers to timeutil.Now(). There may be additional callers in our dependencies that simply can't be made aware of the PTP clock.

Expected behavior

Unclear; do we expect everyone to use the PTP clock? Or are we expecting operators to synchronize the system clock to the PTP clock? In that case though, what is the benefit of using the PTP clock directly. One may wonder whether this feature should have been built.

Additional data / screenshots https://github.com/cockroachdb/cockroach/pull/129123 has a few example errors

While building the PTP test, I whipped up (but did not merge) a quick and dirty PR to use the correct clock source in Gossip: https://github.com/cockroachdb/cockroach/pull/130496

Environment:

Jira issue: CRDB-41839

blathers-crl[bot] commented 1 week ago

Hi @tbg, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.