Open yuzefovich opened 2 weeks ago
(Original issue text)
When running TPCH Q1 on Scale Factor 1000 (about 1TB of data) we had a node crash, seemingly due to running out of disk (?). We did see temp disk storage: disk budget exceeded
in the workload. We should investigate why the crash happened as well as consider the following questions:
disk budget exceeded
error. Perhaps this could explain the crash.See this slack thread for more information.
I've looked into this, and I don't think we can do better from Queries perspective. I set up a 3 node cluster, with 2 stores on each node having 50GB, and imported TPCH SF50 dataset. This made it so that each store used up about 40GB (80%) of capacity, and then I ran TPCH Q1 with different concurrency.
With concurrency 1, via sql.disk.distsql.spilled.bytes.written
metric I confirmed that what we account for against --max-disk-temp-storage
limit (which is 32GiB by default) matches actually used up capacity from disk. One caveat here is that the temporary disk storage usage does not count against used
number; instead, it reduces usable
number.
This way used percentage will be increased, but slowly - we do have the same amount of space available on the second stores on each node.
Then, I issued Q1 with concurrency 10, and it led to disk exhaustion on store 1 on n2 and store 1 on n3, killing cockroach
processes on both nodes and making cluster unavailable. Our disk space accounting didn't help because we didn't reach 32GiB limit on either node (we only had free 10GB on each store).
At this point, I'll re-route this issue for storage team to see whether we can put some guardrails to prevent this type of a crash. In particular, I wonder whether it is possible to extend the vfs
API so that we wouldn't be able to create new files once more than 95% of the underlying disk space is used? Ditto something similar for the pebble API that we use in the row-by-row engine for temp storage.
Guardrails are tricky for the primary storage engine. Specifically, we can't stop writing to disk because:
In the context of spilling to disk, it seems more reasonable to forcefully reject writes. Writes to the row-based temp engine however are shared across queries, and the LSM would need to be able to write new files in order to delete old files. For the vectorized spilling to disk that just uses a vfs.FS directly, I think we could reasonably error out if the disk is approaching capacity.
Is it possible to extend the
vfs
API so that we wouldn't be able to create new files once more than 95% of the underlying disk space is used? Ditto something similar for the pebble API that we use in the row-by-row engine for temp storage.Jira issue: CRDB-44024