yuzefovich commented 2 weeks ago

Is it possible to extend the vfs API so that we wouldn't be able to create new files once more than 95% of the underlying disk space is used? Ditto something similar for the pebble API that we use in the row-by-row engine for temp storage.

Jira issue: CRDB-44024

yuzefovich commented 1 week ago

(Original issue text)

When running TPCH Q1 on Scale Factor 1000 (about 1TB of data) we had a node crash, seemingly due to running out of disk (?). We did see temp disk storage: disk budget exceeded in the workload. We should investigate why the crash happened as well as consider the following questions:

how much actual temporary disk space we used vs what we accounted for before hitting the disk budget exceeded error. Perhaps this could explain the crash.
why we kept on writing to the same device until we ran out of space - is it possible for temp storage to utilize multiple devices transparently?

See this slack thread for more information.

I've looked into this, and I don't think we can do better from Queries perspective. I set up a 3 node cluster, with 2 stores on each node having 50GB, and imported TPCH SF50 dataset. This made it so that each store used up about 40GB (80%) of capacity, and then I ran TPCH Q1 with different concurrency.

With concurrency 1, via sql.disk.distsql.spilled.bytes.written metric I confirmed that what we account for against --max-disk-temp-storage limit (which is 32GiB by default) matches actually used up capacity from disk. One caveat here is that the temporary disk storage usage does not count against used number; instead, it reduces usable number.

This way used percentage will be increased, but slowly - we do have the same amount of space available on the second stores on each node.

Then, I issued Q1 with concurrency 10, and it led to disk exhaustion on store 1 on n2 and store 1 on n3, killing cockroach processes on both nodes and making cluster unavailable. Our disk space accounting didn't help because we didn't reach 32GiB limit on either node (we only had free 10GB on each store).

At this point, I'll re-route this issue for storage team to see whether we can put some guardrails to prevent this type of a crash. In particular, I wonder whether it is possible to extend the vfs API so that we wouldn't be able to create new files once more than 95% of the underlying disk space is used? Ditto something similar for the pebble API that we use in the row-by-row engine for temp storage.

jbowens commented 3 days ago

Guardrails are tricky for the primary storage engine. Specifically, we can't stop writing to disk because:

Stopping the commit of new raft log entries has a similar effect to a disk stall. This results in unavailability of all ranges the node is leaseholder to, even if the disk capacity issue is local to this specific node. If the capacity issue is local to the node, it's preferable for us to crash and allow leases to move then stay up and incapable of making progress. If the node is the leaseholder of the liveness range, this is 💀 . There are various ways in which we could temporarily stop some writes, but they all end in needing to stop all writes very soon which is no good.
In order to reclaim space, we need to write new files. Tombstones (including range deletion tombstones written during replica removals) must be compacted to be able to delete data.

74104 and #79210 discuss the generalized problem of preventing out-of-disk in scenarios where storage is underprovisioned in the context of the configured zone configurations.

In the context of spilling to disk, it seems more reasonable to forcefully reject writes. Writes to the row-based temp engine however are shared across queries, and the LSM would need to be able to write new files in order to delete old files. For the vectorized spilling to disk that just uses a vfs.FS directly, I think we could reasonably error out if the disk is approaching capacity.

cockroachdb / cockroach

storage: consider adding configurable guardrails against disk exhaustion #134333

74104 and #79210 discuss the generalized problem of preventing out-of-disk in scenarios where storage is underprovisioned in the context of the configured zone configurations.