cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.99k stars 3.79k forks source link

storage: investigate fsync latency spikes #106231

Open jbowens opened 1 year ago

jbowens commented 1 year ago

We've seen many instances of fsync latency spikes in cloud clusters (including in cockroachlabs/support#2395). These fsync latency spikes can be 10+ seconds long, but without being the 20 seconds necessary to trigger disk stall detection to terminate the node.

These fsync latency stalls can be extremely disruptive to the cluster. In cockroachlabs/support#2395 overall throughput tanked as eventually every worker in the bounded worker pool becomes stuck on some operation waiting for the slow disk. There are issues (eg, #88699) already tracking the work to reduce the impact of one node's slow disk on overall cluster throughput. But I think there's something additional to investigate with respect to cloud platforms and why these stalls occur.

We should try to reproduce across cloud providers and investigate. For example, write a roachtest that demonstrates the issues mentioned above.

Informs #107623.

Jira issue: CRDB-29450

blathers-crl[bot] commented 1 year ago

Hi @jbowens, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

jbowens commented 1 year ago

We discussed during storage triage and a few other avenues of exploration / remedies were also discussed.

[@RaduBerinde]: The metrics that CockroachDB surfaces (eg, through timeseries) have very low granularity: We collect Store metrics every 10 seconds. This makes it very difficult to observe momentary IOPS exhaustion. Surfacing higher fidelity metrics here could help.

Should we be momentarily exhausting IOPS, short of implementing a user-level IO scheduler, we could:

jbowens commented 1 year ago

WIP roachtest for inducing IOPS starvation: https://github.com/cockroachdb/cockroach/compare/master...jbowens:cockroach:overload-iops-roachtest?expand=1