roachtest: perturbation/metamorphic/decommission failed

cockroach-teamcity commented 1 week ago

roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ 6610d705724a21c836f3521f75972e65d9e9e2d4:

(assertions.go:363).Fail: 
    Error Trace:    pkg/cmd/roachtest/tests/perturbation/framework.go:493
                                pkg/cmd/roachtest/test_runner.go:1307
                                src/runtime/asm_amd64.s:1695
    Error:          Should be true
    Test:           perturbation/metamorphic/decommission
    Messages:       FAILURE: follower-read  : Increase 9.6046 > 5.0000 BASE: 3.850596ms SCORE: 36.983429ms

                    FAILURE: read           : Increase 9.2015 > 5.0000 BASE: 3.917431ms SCORE: 36.046384ms

                    FAILURE: write          : Increase 14.2758 > 5.0000 BASE: 3.817288ms SCORE: 54.494675ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/decommission/run_1

Parameters:

arch=amd64
cloud=gce
coverageBuild=false
cpu=32
encrypted=false
fs=ext4
localSSD=true
runtimeAssertionsBuild=false
ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Jira issue: CRDB-44411}

andrewbaptist commented 1 week ago

This was caused by CPU admission control on n10 with the system waiting on the kv-regular-store-queue for up to 5 seconds during the decommission process. I'm not sure why this kicked in during this test.

Here is a link to the stats at the time.

The CPU was reasonable (~50%), the goroutine count was low and the goroutine scheduler latency wass also low:

Attached is a profile of a slow operation:

2024-11-15T07_15_40Z-UPSERT-5796.155ms.zip

I think the reason for this slowdown was because the system was so "overprovisioned" for this workload.

The config is:

2024/11/15 06:57:22 framework.go:404: test variations are: seed: 6732599062455550972, fillDuration: 10m0s, maxBlockBytes: 1, perturbationDuration: 10s, validationDuration: 5m0s, ratioOfMax: 0.500000, splits: 10000, numNodes: 12, numWorkloadNodes: 1, vcpu: 32, disks: 2, memory: standard, leaseType: epoch, cloud: gce, perturbation: {drain:true}

aadityasondhi commented 1 week ago

I am not familiar with test so I am providing a purely AC overload analysis below. Hopefully this helps in investigating the issue.

This was caused by CPU admission control on n10

What evidence did we have for CPU AC kicking in?

system waiting on the kv-regular-store-queue for up to 5 seconds during the decommission process

I can see that in the metrics as well. The store queue is for stores (i.e. IO overload).

P99 SQL Latency spike near the failure

Correlates well with IO overload

Store work queue has waiting requests.

Substantial spike in L0 sublevels

Compaction queue building up

Read + write bandwidth for reference in case it turns out to be useful later

Now looking into logs.

IO load listener

> I241115 07:15:25.876713 654 util/admission/io_load_listener.go:665 â‹® [T1,Vsystem,n10,s20] 52223 IO overload: compaction score 0.650 (466 ssts, 40 sub-levels), L0 growth 32 MiB (write 32 MiB (ignored 0 B) ingest 118 KiB (ignored 30 MiB)): requests 127093 (88146 bypassed) with 5.6 MiB acc-write (3.9 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 32 MiB adjusted-LSM-writes + 690 MiB adjusted-disk-writes + write-model 3.00x+124 B (smoothed 2.85x+62 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + write-amp-model 21.63x+1 B (smoothed 33.12x+1 B) + at-admission-tokens 188 B, compacted 28 MiB [â‰ˆ21 MiB], flushed 469 MiB [â‰ˆ0 B] (mult 1.00); admitting 16 MiB (rate 1.1 MiB/s) (elastic 1 B rate 0 B/s) due to L0 growth (used total: 15 MiB elastic 0 B); write stalls 0; diskBandwidthLimiter (tokenUtilization 0.00, tokensUsed (elastic 0 B, snapshot 0 B, regular 0 B) tokens (write 0 B (prev 0 B)), writeBW 0 B/s, readBW 0 B/s, provisioned 0 B/s)
> I241115 07:15:25.876561 654 util/admission/io_load_listener.go:665 â‹® [T1,Vsystem,n10,s19] 52222 IO overload: compaction score 0.600 (304 ssts, 37 sub-levels), L0 growth 30 MiB (write 29 MiB (ignored 14 KiB) ingest 92 KiB (ignored 20 MiB)): requests 125912 (87037 bypassed) with 5.5 MiB acc-write (3.9 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + 29 MiB adjusted-LSM-writes + 741 MiB adjusted-disk-writes + write-model 3.00x+106 B (smoothed 2.84x+53 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + write-amp-model 25.18x+1 B (smoothed 34.41x+1 B) + at-admission-tokens 179 B, compacted 26 MiB [â‰ˆ20 MiB], flushed 531 MiB [â‰ˆ0 B] (mult 1.00); admitting 16 MiB (rate 1.1 MiB/s) (elastic 1 B rate 0 B/s) due to L0 growth (used total: 15 MiB elastic 0 B); write stalls 0; diskBandwidthLimiter (tokenUtilization 0.00, tokensUsed (elastic 0 B, snapshot 0 B, regular 0 B) tokens (write 0 B (prev 0 B)), writeBW 0 B/s, readBW 0 B/s, provisioned 0 B/s)

My take here is that we are seeing IO overload due to the growth of L0 (see metrics above). And the reason for the high growth seems to replication writes that are bypassing AC (from logs). Example: requests 125912 (87037 bypassed) with 5.5 MiB acc-write (3.9 MiB bypassed).

Not familiar with the test to know if we expect to have so many bypassed writes. But if these writes are bypassing AC, either an issue with the integration of RAC or the workload is overloading the system.

One side thing to rule out would be bandwidth saturation but I doubt that is the case given this is GCP cluster (high provisioned bandwidth by default) and we have evidence of high bypassed writes.

@andrewbaptist Let me know if this is helpful. It seems to be the classic case of replicated writes overloading the store.

andrewbaptist commented 1 week ago

Thanks for the clarification on the AC metric, I had misread it. You are correct that it is due to IO overload.

To clarify what this test does, it determines what the "50% usage" of the cluster is and runs a constant workload expecting consistent throughput and latency while it makes a change. The rate for this cluster was 134,832 requests/sec.

It then runs a decommission of the node.

The decommission runs from 07:15:06 - 07:15:43. During this ~30s window, there are a lot of background snapshots sent but the regular traffic load does not change (or at least should not change).

However here we see it drops quite a bit by about a factor of 4, because some requests experience significant AC delay.

What I don't understand is why this test is causing IO overload at all. I expected the snapshots will go straight to L6.

The expectation if the system can't handle the incoming rate would be to either: 1) Slow down snapshot transfers 2) Increase compaction rate to prevent overload.

I will take a bit more of a look this afternoon to see why the LSM is getting so inverted.

cockroach-teamcity commented 1 week ago

roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ e83bc46aa42f2476b4b11b9703b8038c660dc980:

(assertions.go:363).Fail: 
    Error Trace:    pkg/cmd/roachtest/tests/perturbation/framework.go:645
                                pkg/cmd/roachtest/test_runner.go:1307
                                src/runtime/asm_amd64.s:1695
    Error:          Should be true
    Test:           perturbation/metamorphic/decommission
    Messages:       FAILURE: follower-read  : Increase 11.6611 > 5.0000 BASE: 20.186537ms SCORE: 235.397432ms

                    FAILURE: read           : Increase 11.9632 > 5.0000 BASE: 19.833571ms SCORE: 237.273606ms

                    FAILURE: write          : Increase 14.5329 > 5.0000 BASE: 17.030372ms SCORE: 247.500542ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/decommission/run_1

Parameters:

arch=amd64
cloud=gce
coverageBuild=false
cpu=4
encrypted=false
fs=ext4
localSSD=true
runtimeAssertionsBuild=false
ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

aadityasondhi commented 6 days ago

Following our slack discussion.

I want to wrap up the open thread here. The TLDR is that we saw L0 sublevel growth simply because bytes into L0 > compacted bytes out of L0. We see from metrics and logs that there were no snapshots being ingested into L0, so pacing snapshots would do nothing to help here. A quick mitigation here is to increase compaction concurrency for this test using the env variable.

Some more details about the above:

In the IO load listener logs above we see L0 growth 32 MiB (write 32 MiB (ignored 0 B) , note that no ingests are landing in L0, so slowing down snapshot ingest would do nothing to help here. Comparing the growth (bytes into L0) with the compacted bytes shows a positive delta: compacted 28 MiB [â‰ˆ21 MiB].
AC kicks in because of the overload threshold >0.5 in this case, but it cannot pace replication traffic that is regular priority. RACv2 should address that. requests 127093 (88146 bypassed) High number of bypassed bytes.
What you see here happen is, AC sees that the L0 is growing because of the rates mentioned above and it will dial down the tokens to avoid this L0 growth. And since we cannot pace regular replication traffic (bypassed bytes) it will limit only the locally admitted writes. Thus, leading to AC queuing that this test saw.

Ideally, we will have replication AC for regular traffic. In the meantime, my recommendation for you is to increase the compaction concurrency for this test since we have ample CPU.

I am going to remove the AC assignment from this issue. There is nothing actionable on AC that could be done here. Other than having RACv2 for regular traffic, which is tracked separately.

kvoli commented 6 days ago

Marking as a c-bug/a-testing based on the above comment which suggests increasing the compaction concurrency (configuration related).

andrewbaptist commented 6 days ago

Adding a link to #74697 since I don't see a more general story for auto-tuning compaction concurrency. Also a link to the current guidance: https://www.cockroachlabs.com/docs/stable/architecture/storage-layer#compaction

This would be hard to do metamorphically unless there was some more clear guidance on how this should be tuned. And if we have that guidance why wouldn't we encode this in the system rather than in the test framework.

I'll watch for additional failures on this test and try and get a set of workarounds for different hardware configurations.

aadityasondhi commented 5 days ago

And if we have that guidance why wouldn't we encode this in the system rather than in the test framework.

It's like manually tuning any rate. If there is ample room, the guidance is to keep increasing until the desired effect is reached and you dont over utilize the CPU. We keep the base low to avoid over utilizing the CPU.

Adding a link to https://github.com/cockroachdb/cockroach/issues/74697

Thanks! We have actively started working on a possible solution to the problem. @itsbilal you might find this interesting as you are starting to look at a design and prototype for such a case.

aadityasondhi commented 5 days ago

I think this is a more current version of it https://github.com/cockroachdb/pebble/issues/1329.

cockroachdb / cockroach

roachtest: perturbation/metamorphic/decommission failed #135241