cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.97k stars 3.79k forks source link

streamingccl: batch timestamp error not returned to client #113863

Open stevendanna opened 11 months ago

stevendanna commented 11 months ago

Describe the problem

In a recent replication stream failure, the job went into a paused state with an error that looked like:

pausing due to error; use RESUME JOB to try to proceed once the issue is resolved, or CANCEL JOB to rollback: timeout: context canceled 

But, when digging into the logs, it appears that the real error was:

logs/cockroach.cct-232-dest-0004.ubuntu.2023-11-02T21_18_39Z.012485.log:E231103 03:49:26.987900 2425145 ccl/streamingccl/streamingest/stream_ingestion_processor.go:556 ⋮ [T1,Vsystem,n4,f‹a35bb8dd›,job=913868579942367233,distsql.gateway=9,distsql.appname=‹$ internal-resume-job-913868579942367233›] 41872  error on close(): ‹ERROR: batch timestamp 1698960957.937065386,0 must be after replica GC threshold 1698967771.741796894,0 (SQLSTATE XXUUU)›

This error should have been raised to the overall job. Further, we likely should make this a permanent error that we don't retry since no amount of retry is going to fix it.

Jira issue: CRDB-33238

blathers-crl[bot] commented 11 months ago

Hi @stevendanna, please add branch-* labels to identify which branch(es) this release-blocker affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] commented 11 months ago

cc @cockroachdb/disaster-recovery