cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.88k stars 3.77k forks source link

backup: elevated tail latencies in SQL workload while backing up to s3 #115190

Open dt opened 9 months ago

dt commented 9 months ago

We have observed that backups to s3 cause increased tail latencies in foreground traffic, sometimes significantly.

We currently see some cases where 600+ of heap is inuse by the chunk buffers in the sdk, leading to increased rates of GC (even absent memory pressure but rather just due to its size relative to the live heap if a reasonable GOGC is not set e.g. due to #115164 ).

These more frequent GC runs appear to also see higher per-run pause times, sometimes much higher.

The S3 SDK hashes chunk sized (currently 8mb) blocks both with MD5 and SHA256, for content checksum and signing respectively. It appears that due in large part to https://github.com/golang/go/issues/64417, this causes us to observe long gc pause times and traces show STW pauses overlapping with block hashing.

This is a tracking issue for all related issues.

Jira issue: CRDB-33924

blathers-crl[bot] commented 9 months ago

cc @cockroachdb/disaster-recovery