cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.01k stars 3.79k forks source link

changefeedccl: on_error_pause with expired replica GC. #77544

Open miretskiy opened 2 years ago

miretskiy commented 2 years ago

Observed in 21.2; may impact master.

Running changefeed with OptOnErrorPause together with OptProtectDataFromGCOnPause, if replica GC expires, we attempt to pause, which returns an error (replica GC threshold exceeded). This results inSHOW JOBS indicating that the changefeed is running.

There are at least 2 issues here:

  1. Show jobs should not "lie"
  2. When replica GC expires, we should not pretend that we can handle this error -- the changefeed must fail with permanent error.

Note: https://github.com/cockroachdb/cockroach/pull/76605 modified PTS handling; master (and 22.1) may or may not be effected; but this needs to be verified and invesgated.

Jira issue: CRDB-13644

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/cdc

amruss commented 2 years ago

Query ran: select job_id, status,((high_water_timestamp/1000000000)::int::timestamp)-now() as "changefeed latency",created, left(description,60),high_water_timestamp from crdb_internal.jobs where job_type = 'CHANGEFEED' and status in ('running', 'paused','pause-requested') order by created desc;

amruss commented 2 years ago

@gh-casper we should explicitly handle the runover gc error to make sure we reflect the failed job in the jobs table

We should also backport this