cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

backup: Backup jobs failing in drt-chaos #131174

Closed csgourav closed 1 month ago

csgourav commented 1 month ago

We saw backup failures on drt-chaos - All backup jobs are failing with error message shown below - ExportRequest for span %d timed out after 5m failed to run backup: running distributed backup to export 13544 ranges: export request timeout: operation "ExportRequest for span /Table/109/182/16{59/"|\x9a\xe7`\vvGr\xbd/\xb8\rz\x16K\xd3"-61/"\t<\x8d\xc93\x94J\x00\x93\xb9\xac\x83\x11\t_3"}" timed out after 5m0.001s (given timeout 5m0s): aborted in DistSender: result is ambiguous: context deadline exceeded These are hourly backup jobs. Also, oldest protected timestamp pretty high around 1d14h

CPU was around 60-80%, with tpcc and kv workload running on the cluster.

More details in thread

Jira issue: CRDB-42428

blathers-crl[bot] commented 1 month ago

cc @cockroachdb/disaster-recovery

dt commented 1 month ago

On the hardware dashboard I'm seeing 95%+ CPU utilization, and .98s/1s elastic token exhaustion on the overload dashboard; this looks like AC performing as expected.

dt commented 1 month ago
Screenshot 2024-09-23 at 14 17 33 Screenshot 2024-09-23 at 14 17 26