Open cockroach-teamcity opened 1 week ago
This was caused by CPU admission control on n10 with the system waiting on the kv-regular-store-queue
for up to 5 seconds during the decommission process. I'm not sure why this kicked in during this test.
Here is a link to the stats at the time.
The CPU was reasonable (~50%), the goroutine count was low and the goroutine scheduler latency wass also low:
Attached is a profile of a slow operation:
2024-11-15T07_15_40Z-UPSERT-5796.155ms.zip
I think the reason for this slowdown was because the system was so "overprovisioned" for this workload.
The config is:
2024/11/15 06:57:22 framework.go:404: test variations are: seed: 6732599062455550972, fillDuration: 10m0s, maxBlockBytes: 1, perturbationDuration: 10s, validationDuration: 5m0s, ratioOfMax: 0.500000, splits: 10000, numNodes: 12, numWorkloadNodes: 1, vcpu: 32, disks: 2, memory: standard, leaseType: epoch, cloud: gce, perturbation: {drain:true}
I am not familiar with test so I am providing a purely AC overload analysis below. Hopefully this helps in investigating the issue.
This was caused by CPU admission control on n10
What evidence did we have for CPU AC kicking in?
system waiting on the
kv-regular-store-queue
for up to 5 seconds during the decommission process
I can see that in the metrics as well. The store queue is for stores (i.e. IO overload).
Now looking into logs.
My take here is that we are seeing IO overload due to the growth of L0 (see metrics above). And the reason for the high growth seems to replication writes that are bypassing AC (from logs). Example: requests 125912 (87037 bypassed) with 5.5 MiB acc-write (3.9 MiB bypassed)
.
Not familiar with the test to know if we expect to have so many bypassed writes. But if these writes are bypassing AC, either an issue with the integration of RAC or the workload is overloading the system.
One side thing to rule out would be bandwidth saturation but I doubt that is the case given this is GCP cluster (high provisioned bandwidth by default) and we have evidence of high bypassed writes.
@andrewbaptist Let me know if this is helpful. It seems to be the classic case of replicated writes overloading the store.
Thanks for the clarification on the AC metric, I had misread it. You are correct that it is due to IO overload.
To clarify what this test does, it determines what the "50% usage" of the cluster is and runs a constant workload expecting consistent throughput and latency while it makes a change. The rate for this cluster was 134,832 requests/sec.
It then runs a decommission of the node.
The decommission runs from 07:15:06 - 07:15:43
. During this ~30s window, there are a lot of background snapshots sent but the regular traffic load does not change (or at least should not change).
However here we see it drops quite a bit by about a factor of 4, because some requests experience significant AC delay.
What I don't understand is why this test is causing IO overload at all. I expected the snapshots will go straight to L6.
The expectation if the system can't handle the incoming rate would be to either: 1) Slow down snapshot transfers 2) Increase compaction rate to prevent overload.
I will take a bit more of a look this afternoon to see why the LSM is getting so inverted.
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ e83bc46aa42f2476b4b11b9703b8038c660dc980:
(assertions.go:363).Fail:
Error Trace: pkg/cmd/roachtest/tests/perturbation/framework.go:645
pkg/cmd/roachtest/test_runner.go:1307
src/runtime/asm_amd64.s:1695
Error: Should be true
Test: perturbation/metamorphic/decommission
Messages: FAILURE: follower-read : Increase 11.6611 > 5.0000 BASE: 20.186537ms SCORE: 235.397432ms
FAILURE: read : Increase 11.9632 > 5.0000 BASE: 19.833571ms SCORE: 237.273606ms
FAILURE: write : Increase 14.5329 > 5.0000 BASE: 17.030372ms SCORE: 247.500542ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/decommission/run_1
Parameters:
arch=amd64
cloud=gce
coverageBuild=false
cpu=4
encrypted=false
fs=ext4
localSSD=true
runtimeAssertionsBuild=false
ssd=2
See: roachtest README
See: How To Investigate (internal)
See: Grafana
Following our slack discussion.
I want to wrap up the open thread here. The TLDR is that we saw L0 sublevel growth simply because bytes into L0 > compacted bytes out of L0. We see from metrics and logs that there were no snapshots being ingested into L0, so pacing snapshots would do nothing to help here. A quick mitigation here is to increase compaction concurrency for this test using the env variable.
Some more details about the above:
L0 growth 32 MiB (write 32 MiB (ignored 0 B)
, note that no ingests are landing in L0, so slowing down snapshot ingest would do nothing to help here. Comparing the growth (bytes into L0) with the compacted bytes shows a positive delta: compacted 28 MiB [≈21 MiB]
.regular
priority. RACv2 should address that. requests 127093 (88146 bypassed)
High number of bypassed bytes.Ideally, we will have replication AC for regular traffic. In the meantime, my recommendation for you is to increase the compaction concurrency for this test since we have ample CPU.
I am going to remove the AC assignment from this issue. There is nothing actionable on AC that could be done here. Other than having RACv2 for regular traffic, which is tracked separately.
Marking as a c-bug/a-testing based on the above comment which suggests increasing the compaction concurrency (configuration related).
Adding a link to #74697 since I don't see a more general story for auto-tuning compaction concurrency. Also a link to the current guidance: https://www.cockroachlabs.com/docs/stable/architecture/storage-layer#compaction
This would be hard to do metamorphically unless there was some more clear guidance on how this should be tuned. And if we have that guidance why wouldn't we encode this in the system rather than in the test framework.
I'll watch for additional failures on this test and try and get a set of workarounds for different hardware configurations.
And if we have that guidance why wouldn't we encode this in the system rather than in the test framework.
It's like manually tuning any rate. If there is ample room, the guidance is to keep increasing until the desired effect is reached and you dont over utilize the CPU. We keep the base low to avoid over utilizing the CPU.
Adding a link to https://github.com/cockroachdb/cockroach/issues/74697
Thanks! We have actively started working on a possible solution to the problem. @itsbilal you might find this interesting as you are starting to look at a design and prototype for such a case.
I think this is a more current version of it https://github.com/cockroachdb/pebble/issues/1329.
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ 6610d705724a21c836f3521f75972e65d9e9e2d4:
Parameters:
arch=amd64
cloud=gce
coverageBuild=false
cpu=32
encrypted=false
fs=ext4
localSSD=true
runtimeAssertionsBuild=false
ssd=2
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
/cc @cockroachdb/kv-triageThis test on roachdash | Improve this report!
Jira issue: CRDB-44411