Open miretskiy opened 11 months ago
cc @cockroachdb/cdc
Caveat, serverless is actually running https://github.com/cockroachlabs/release-staging/releases/tag/v23.1.3-121-gd8d427b1574
This is a custom build using the candidate SHA for 23.1.4 w/ a single commit cherry-picked.
In console this shows as v23.1.3
which is why the user reported so.
Looks like we correctly protect both the data as well as system.descriptor table. So, not clear what exactly is going on.
When a changefeed starts, we create a PTS record in system.protected_ts_records
using the PTS store API, but this is not actually apply protections. These protections need to be translated to span configs stored in system.span_configurations
before they are actually checked when the KV server attempts GC. This translation is done asynchronously by the auto span reconciliation job which runs a rangefeed to watch system.protected_ts_records
.
Because of this delay, this scenario is possible:
system.protected_ts_records
in the txn to protect the data and descriptors. The highwater of the changefeed is the statement time. I have a repro here https://github.com/jayshrivastava/cockroach/commit/221d43e04c7f89ffde320fc479226202b23c4596 which simulates the above scenario and produces the "replica GC threshold" error.
There are two problems which cause this failure:
Also note that as long as the first PTS record is actually active before we start the aggregators, the remaining PTS updates are safe to be preempted and resumed later. This is because a changefeed will only bump up the timestamp of the PTS record it makes initially.
I have confirmation that the information from PTS table was not propagated down to the system.span_configuration table.
As reported by serverless customer, changefeed failed with the replica GC error:
The changefeed seems to be well configured, and uses new webhook sink. The job record does not have MaximumPTSAge set (more on this below). Cluster is running 23.1.3 version. The changefeed utilizes CDC query (very simple one SELECT * WHERE status='hello')
The error could happen if something strange happening with protected timestamp record. So, possibilities are:
In addition, releases after 23.1.3 have changefeed.protect_timestamp.max_age which should expire PTS record older than 4 days... but that functionality will certainly cause problems on serverless.
All of the above needs to be investigated/verified and fixed.
Jira issue: CRDB-29640
Epic CRDB-11783