cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.48k stars 3.69k forks source link

changefeedccl: PTS management on Serverless #106608

Open miretskiy opened 11 months ago

miretskiy commented 11 months ago

As reported by serverless customer, changefeed failed with the replica GC error:

job failed (batch timestamp 1688843844.427894356,1 must be after replica GC threshold 1688992414.535792633,0) but is being paused because of on_error=pause

The changefeed seems to be well configured, and uses new webhook sink. The job record does not have MaximumPTSAge set (more on this below). Cluster is running 23.1.3 version. The changefeed utilizes CDC query (very simple one SELECT * WHERE status='hello')

The error could happen if something strange happening with protected timestamp record. So, possibilities are:

In addition, releases after 23.1.3 have changefeed.protect_timestamp.max_age which should expire PTS record older than 4 days... but that functionality will certainly cause problems on serverless.

All of the above needs to be investigated/verified and fixed.

Jira issue: CRDB-29640

Epic CRDB-11783

blathers-crl[bot] commented 11 months ago

cc @cockroachdb/cdc

alyshanjahani-crl commented 11 months ago

Caveat, serverless is actually running https://github.com/cockroachlabs/release-staging/releases/tag/v23.1.3-121-gd8d427b1574

This is a custom build using the candidate SHA for 23.1.4 w/ a single commit cherry-picked.

In console this shows as v23.1.3 which is why the user reported so.

miretskiy commented 11 months ago

Looks like we correctly protect both the data as well as system.descriptor table. So, not clear what exactly is going on.

jayshrivastava commented 11 months ago

When a changefeed starts, we create a PTS record in system.protected_ts_records using the PTS store API, but this is not actually apply protections. These protections need to be translated to span configs stored in system.span_configurations before they are actually checked when the KV server attempts GC. This translation is done asynchronously by the auto span reconciliation job which runs a rangefeed to watch system.protected_ts_records.

Because of this delay, this scenario is possible:

  1. Create a changefeed job in a transaction. Write to system.protected_ts_records in the txn to protect the data and descriptors. The highwater of the changefeed is the statement time.
  2. The SQL pod shuts down, disabling the span reconciliation job before the PTS record is translated to a span config.
  3. The KV server remains running and GCs the data.
  4. The SQL pod starts up and the changefeed starts and attempts to read data using the statement time in (1).
  5. The changefeed get's an error saying that the timestamp is less than the GC threshold and the data is gone.

I have a repro here https://github.com/jayshrivastava/cockroach/commit/221d43e04c7f89ffde320fc479226202b23c4596 which simulates the above scenario and produces the "replica GC threshold" error.

There are two problems which cause this failure:

  1. Creating a PTS record in a txn does not mean protections are immediately active. There is some delay.
  2. The SQL pod can be shut down after a PTS record is created but before it becomes a span config (notably, changefeeds mark themselves idle too early, which is a bug https://cockroachlabs.slack.com/archives/C01L5LYD401/p1689177635731319?thread_ts=1689111104.143699&cid=C01L5LYD401)

Also note that as long as the first PTS record is actually active before we start the aggregators, the remaining PTS updates are safe to be preempted and resumed later. This is because a changefeed will only bump up the timestamp of the PTS record it makes initially.

miretskiy commented 11 months ago

I have confirmation that the information from PTS table was not propagated down to the system.span_configuration table.

jayshrivastava commented 11 months ago

More discussion here https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1689344991884709?thread_ts=1689343376.821539&cid=C0KB9Q03D