admission: improve write request size estimation to account for proposed writes and follower writes

sumeerbhola commented 2 years ago

The existing size estimation logic uses the same write bytes adjustment for all requests. This means that when there are tiny writes that bypass admission control, they mistakenly consume all tokens. The following is an example of a kv0 workload using 64KB blocks, and turning off admission control and then turning it back on after sublevel count was > 90. Even though there are a substantial number of byte tokens, none of them are being given out to regular requests. Also, the first estimation of +3.3 MiB/req is too high since the admission control accounting by the WorkQueue was off for much of the interval, so the requests have been undercounted.

I220607 18:00:09.409735 539 util/admission/granter.go:1745 ⋮ [s1] 13163  IO overload: 6147 ssts, 92 sub-levels, L0 growth 192 MiB: 0 B acc-write + 0 B acc-ingest + 192 MiB unacc [≈3.3 MiB/req, n=1732, bypassed=1337], compacted 296 MiB [≈431 MiB]; admitting 252 MiB with L0 penalty: +3.3 MiB/req, *0.50/ingest
I220607 18:00:24.410729 539 util/admission/granter.go:1745 ⋮ [s1] 13164  IO overload: 5991 ssts, 91 sub-levels, L0 growth 26 MiB: 0 B acc-write + 0 B acc-ingest + 26 MiB unacc [≈2.0 MiB/req, n=39, bypassed=39], compacted 254 MiB [≈342 MiB]; admitting 211 MiB with L0 penalty: +2.0 MiB/req, *0.50/ingest
I220607 18:00:39.409691 539 util/admission/granter.go:1745 ⋮ [s1] 13165  IO overload: 5532 ssts, 82 sub-levels, L0 growth 0 B: 0 B acc-write + 0 B acc-ingest + 0 B unacc [≈2.0 MiB/req, n=37, bypassed=37], compacted 486 MiB [≈414 MiB]; admitting 209 MiB with L0 penalty: +2.0 MiB/req, *0.50/ingest
I220607 18:00:54.409622 539 util/admission/granter.go:1745 ⋮ [s1] 13169  IO overload: 5213 ssts, 77 sub-levels, L0 growth 0 B: 0 B acc-write + 0 B acc-ingest + 0 B unacc [≈2.0 MiB/req, n=70, bypassed=70], compacted 492 MiB [≈453 MiB]; admitting 218 MiB with L0 penalty: +2.0 MiB/req, *0.50/ingest
I220607 18:01:09.410880 539 util/admission/granter.go:1745 ⋮ [s1] 13175  IO overload: 5087 ssts, 77 sub-levels, L0 growth 0 B: 0 B acc-write + 0 B acc-ingest + 0 B unacc [≈2.0 MiB/req, n=87, bypassed=87], compacted 182 MiB [≈317 MiB]; admitting 188 MiB with L0 penalty: +2.0 MiB/req, *0.50/ingest
I220607 18:01:24.409546 539 util/admission/granter.go:1745 ⋮ [s1] 13176  IO overload: 4515 ssts, 69 sub-levels, L0 growth 5.4 MiB: 0 B acc-write + 0 B acc-ingest + 5.4 MiB unacc [≈1.0 MiB/req, n=165, bypassed=165], compacted 704 MiB [≈511 MiB]; admitting 222 MiB with L0 penalty: +1.0 MiB/req, *0.50/ingest
I220607 18:01:39.410407 539 util/admission/granter.go:1745 ⋮ [s1] 13178  IO overload: 4306 ssts, 65 sub-levels, L0 growth 0 B: 0 B acc-write + 0 B acc-ingest + 0 B unacc [≈1.0 MiB/req, n=48, bypassed=48], compacted 311 MiB [≈411 MiB]; admitting 214 MiB with L0 penalty: +1.0 MiB/req, *0.50/ingest
I220607 18:01:54.410339 539 util/admission/granter.go:1745 ⋮ [s1] 13182  IO overload: 3996 ssts, 62 sub-levels, L0 growth 0 B: 0 B acc-write + 0 B acc-ingest + 0 B unacc [≈1.0 MiB/req, n=210, bypassed=210], compacted 362 MiB [≈387 MiB]; admitting 204 MiB with L0 penalty: +1.0 MiB/req, *0.50/ingest
I220607 18:02:09.413993 539 util/admission/granter.go:1745 ⋮ [s1] 13183  IO overload: 3831 ssts, 62 sub-levels, L0 growth 0 B: 0 B acc-write + 0 B acc-ingest + 0 B unacc [≈1.0 MiB/req, n=112, bypassed=112], compacted 275 MiB [≈331 MiB]; admitting 184 MiB with L0 penalty: +1.0 MiB/req, *0.50/ingest
I220607 18:02:24.409653 539 util/admission/granter.go:1745 ⋮ [s1] 13184  IO overload: 3349 ssts, 58 sub-levels, L0 growth 0 B: 0 B acc-write + 0 B acc-ingest + 0 B unacc [≈1.0 MiB/req, n=63, bypassed=63], compacted 479 MiB [≈405 MiB]; admitting 194 MiB with L0 penalty: +1.0 MiB/req, *0.50/ingest
I220607 18:02:39.410757 539 util/admission/granter.go:1745 ⋮ [s1] 13186  IO overload: 3262 ssts, 52 sub-levels, L0 growth 5.7 MiB: 0 B acc-write + 0 B acc-ingest + 5.7 MiB unacc [≈544 KiB/req, n=107, bypassed=107], compacted 152 MiB [≈278 MiB]; admitting 166 MiB with L0 penalty: +544 KiB/req, *0.50/ingest
I220607 18:02:54.410259 539 util/admission/granter.go:1745 ⋮ [s1] 13187  IO overload: 2909 ssts, 52 sub-levels, L0 growth 0 B: 0 B acc-write + 0 B acc-ingest + 0 B unacc [≈544 KiB/req, n=66, bypassed=66], compacted 582 MiB [≈430 MiB]; admitting 191 MiB with L0 penalty: +544 KiB/req, *0.50/ingest
I220607 18:03:09.410000 539 util/admission/granter.go:1745 ⋮ [s1] 13188  IO overload: 2703 ssts, 47 sub-levels, L0 growth 0 B: 0 B acc-write + 0 B acc-ingest + 0 B unacc [≈544 KiB/req, n=72, bypassed=72], compacted 272 MiB [≈351 MiB]; admitting 183 MiB with L0 penalty: +544 KiB/req, *0.50/ingest
I220607 18:03:24.409768 539 util/admission/granter.go:1745 ⋮ [s1] 13189  IO overload: 2325 ssts, 41 sub-levels, L0 growth 0 B: 0 B acc-write + 0 B acc-ingest + 0 B unacc [≈544 KiB/req, n=65, bypassed=65], compacted 403 MiB [≈377 MiB]; admitting 186 MiB with L0 penalty: +544 KiB/req, *0.50/ingest
I220607 18:03:39.419006 539 util/admission/granter.go:1745 ⋮ [s1] 13190  IO overload: 2093 ssts, 37 sub-levels, L0 growth 6.0 MiB: 0 B acc-write + 0 B acc-ingest + 6.0 MiB unacc [≈298 KiB/req, n=116, bypassed=116], compacted 392 MiB [≈385 MiB]; admitting 189 MiB with L0 penalty: +298 KiB/req, *0.50/ingest
I220607 18:03:54.409562 539 util/admission/granter.go:1745 ⋮ [s1] 13194  IO overload: 2093 ssts, 37 sub-levels, L0 growth 0 B: 0 B acc-write + 0 B acc-ingest + 0 B unacc [≈298 KiB/req, n=419, bypassed=419], compacted 0 B [≈192 MiB]; admitting 143 MiB with L0 penalty: +298 KiB/req, *0.50/ingest
I220607 18:04:09.409528 539 util/admission/granter.go:1745 ⋮ [s1] 13196  IO overload: 2017 ssts, 36 sub-levels, L0 growth 0 B: 0 B acc-write + 0 B acc-ingest + 0 B unacc [≈298 KiB/req, n=114, bypassed=114], compacted 129 MiB [≈161 MiB]; admitting 112 MiB with L0 penalty: +298 KiB/req, *0.50/ingest

Jira issue: CRDB-16503

sumeerbhola commented 2 years ago

Most of the bypassing requests are TruncateLog. These end up synchronized in this workload across the 1000+ ranges because the writes are randomly distributed across all ranges. I suspect this is less of a problem in the real world. And I don't have good simple ideas on how to fix this. One simple idea is to put an arbitrary bound (say 500 bytes) on the tokens consumed by requests that bypass admission control, even if the estimate is higher. But then we have a risk that if we have situations where 500 bytes is too low and many requests are bypassing admission control, that we will give out too many tokens. If we then adjust the estimate based only on requests that did not bypass admission control, it will compensate for the next cycle, but then if no bypass requests are received it will overcompensate. Essentially, it becomes tricky if the number of bypassing requests received per interval have huge fluctuations, which is what happens with these TruncateLog requests.

So I plan to do nothing unless we see this as a problem in real settings.

sumeerbhola commented 2 years ago

Came up with a better idea when working on https://github.com/cockroachdb/cockroach/pull/82813 We would use an estimate at admission time as usual, but when the work is done we would fix it by calling

func (q *StoreWorkQueue) AdmittedWorkDone(h StoreWorkHandle, doneInfo StoreWorkDoneInfo) error

where StoreWorkDoneInfo is defined as

type StoreWorkDoneInfo struct {
    // For ingests, ActualBytes is the size of the sstables. For normal writes,
    // it is the size of the batch. If StoreWriteWorkInfo.WriteBytes > 0, it
    // must be equal to ActualBytes (that is the case where the bytes were known
    // at admission time).
    ActualBytes int64
    // ActualBytesIntoL0 <= ActualBytes. For normal writes this is the equality
    // relationship. For ingests, these are the (approximate) bytes that were
    // ingested into L0.
    ActualBytesIntoL0 int64
}

So we fix the estimation after request evaluation. We would also use this to eliminate the fractionOfIngestIntoL0 estimation.

sumeerbhola commented 2 years ago

This issue encompasses estimation of writes at followers (regular writes and ingests), without which both our token estimation and token consumption becomes flawed.

cockroachdb / cockroach

admission: improve write request size estimation to account for proposed writes and follower writes #82536