kv: lhs range with 0 bytes will never be merged

cindyzqtnew commented 1 year ago

Describe the problem

There are a lot of 0-byte ranges in the cluster but not serial. For example,

R1: 0 bytes R2: 10 bytes R3: 0 bytes R4: 10 bytes ....

If the merge-threshold is 5 bytes, then in the merge queue( pkg/kv/kvserver/merge_queue.go)

R1 will never be merged, since R2 is bigger than the threshold.

What should we deal with such case?

To Reproduce

What did you do? Describe in your own words.

If possible, provide steps to reproduce the behavior:

Set up CockroachDB cluster ...
Send SQL ... / CLI command ...
Look at UI / log file / client app ...
See error

Expected behavior A clear and concise description of what you expected to happen.

Additional data / screenshots If the problem is SQL-related, include a copy of the SQL query and the schema of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue creating this issue, but contact support@cockroachlabs.com to submit the log files in private.

If applicable, add screenshots to help explain your problem.

Environment:

CockroachDB version [e.g. 2.0.x]
Server OS: [e.g. Linux/Distrib]
Client app [e.g. cockroach sql, JDBC, ...]

Additional context What was the impact?

Add any other context about the problem here.

Jira issue: CRDB-26438

blathers-crl[bot] commented 1 year ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@cockroachdb/kv (found keywords: kv)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

knz commented 1 year ago

Hello, thank you for your report.

These ranges are marked to have forced split points and these forced splits prevent them from being automatically merged.

We force splits for a number of system ranges, including most ranges with low IDs.

Some of them can be merged nonetheless using the cluster setting spanconfig.storage_coalesce_adjacent.enabled. Some others will always remain split. In any case, I believe this is expected behavior.

Does this cause a problem in your deployment?

cindyzqtnew commented 1 year ago

Hi @knz ,

I don't mean the sys ranges, but just the user data ranges.

In my cluster, users insert rows and then delete them later. Thus there are more and more 0-byte ranges which are not serial. Just like some holes in the overall key spans. We want to merge them since there is no data in the ranges, but maintaining them costs resource and the number of ranges will grow constantly which may lead to stability issue.

If I enable the setting spanconfig.storage_coalesce_adjacent.enabled, will the system ranges you mentioned be merged but should be maintained split?

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

knz commented 1 year ago

Indeed here: https://github.com/cockroachdb/cockroach/blob/00ec8d330a18d77e3aaa2e6d02e79f2d03a01404/pkg/kv/kvserver/merge_queue.go#L244-L261

The merge should occur if either side is below the min size threshold, instead of when both are.

andrewbaptist commented 1 year ago

@kvoli - we should probably discuss the approach here.

It's not quite as simple as making it so that either side is below the minimum. If one size is close to the maximum, then we avoid merging so we don't split soon afterward. The way this is currently configured is a little wonky since there are two closely related but different checks to prevent merge-then-split.

A few lines further down https://github.com/cockroachdb/cockroach/blob/00ec8d330a18d77e3aaa2e6d02e79f2d03a01404/pkg/kv/kvserver/merge_queue.go#L292 where we check if it would split again. The "split again" check is not great for preventing splits as it might allow us to get to "very close" to the split limit so that the next write causes a split.

Short term, setting the min bytes up large in this cluster (even as high as the max bytes) will force those ranges to merge, however a bigger refactor of this is probably due.

One proposal would be to have RangeMaxBytes as we do today and a second parameter RangeMinUtilizationRatio which would by default be something like 0.5 and signifies that if adjacent ranges can be merged and the merged size is less than that ratio than it will merge adjacent ranges. This second parameter could be a system level configuration rather than a zone configuration. I think this would be less confusing for users and cleaner to implement. Today RangeMinBytes is a very confusing parameter.

erikgrinaker commented 1 year ago

It's not quite as simple as making it so that either side is below the minimum. If one size is close to the maximum, then we avoid merging so we don't split soon afterward. The way this is currently configured is a little wonky since there are two closely related but different checks to prevent merge-then-split.

That's fine though -- there's no point merging only to split again, and we can't rebalance the ranges without merging them first. But in the common case where the merged range won't exceed the max size, we should merge them.

kvoli commented 1 year ago

If either the LHS or RHS is less than RangeMinBytes and not shouldSplit(LHS + RHS) you're worried we may be close to the split threshold @andrewbaptist if we do merge?

I think adding a cluster setting (or perhaps span config field) like you suggested that adds an additional conjunction for the min utilization ratio would be a quick fix, along with changing the min check to be a disjunction instead of conjunction ((LHS < min && RHS < min) to (LHS < min || RHS < min)).

So we end up with

(LHS < min || RHS < min) && !shouldSplit(LHS+RHS) && (minUtilizationRatio * maxBytes > (LHS+RHS))

andrewbaptist commented 1 year ago

Today by default we have RangeMinBytes = RangeMaxBytes/4

There are really 5 scenarios, that are worth considering:

1) LHS and RHS are both less than the RangeMinBytes - in this case we merge today and should continue to. 2) One (but not both) of LHS and RHS are less than RangeMinBytes - but the merged range will be less than RangeMaxBytes/2 - This should merge but doesn't merge today. 3) One (but not both) of LHS and RHS are less than RangeMinBytes - but the merged range will be between RangeMaxBytes/2 and RangeMaxBytes - This doesn't merge today and its not clear if it should. 4) Both LHS and RHS are greater than RangeMinBytes but the merged range will be between RangeMaxBytes/2 and RangeMaxBytes - This doesn't merge today and its not clear if it should. 5) The merged range will be greater than RangeMaxBytes - don't merge.

The proposal is to have a configuration with a ZONE CONFIG RangeMaxBytes (default 512MB) and a SYSTEM CONFIG RangeMaxMergeRatio which by default would be 0.5. We could likely set the RangeMaxMergeRatio higher if we broke this out this way (0.75?). Then the cases are simpler to think through both internally and for end users.

andrewbaptist commented 1 year ago

along with changing the min check to be a disjunction instead of conjunction

I don't think the min check makes sense anymore if we add this parameter. Especially for case 4 I describe above.

erikgrinaker commented 1 year ago

I'm not sure I understand the motivation for this new parameter. If we have one range that's 5% full and one range that's 90% full, you're right that the combined range might soon reach 100% and get split. But if that happens, we'll end up with two 50% full ranges instead of one 5% full and one 95% full. Surely that's an improvement?

andrewbaptist commented 1 year ago

My motivation was to try to avoid more frequent splits/merges than we have today. You do bring up an interesting case that I hadn't thought about which is "replica neighbor rebalancing". Specifically, there are a few cases close to what you described:

1) Replica 1 at 5% max size, Replica 2 at 90% max size - join (size of 95%) and might split soon. 2) Replica 1 at 15% max size, Replica 2 at 90% max size - join (size of 105%) and then immediately split (?)

We don't do this today and I haven't thought it through, but it does seem useful. By setting to the parameter to 0.5 we would get a "strict" improvement on today's behavior. By setting it to 1.0 we would get the behavior of joining up to the max split size. There wouldn't be a way to get the behavior described by my second case, but that might not be important to do.

andrewbaptist commented 1 year ago

I'm not sure I understand the motivation for this new parameter. If we have one range that's 5% full and one range that's 90% full, you're right that the combined range might soon reach 100% and get split. But if that happens, we'll end up with two 50% full ranges instead of one 5% full and one 95% full. Surely that's an improvement?

Consider the case where two adjacent ranges have a "mostly static" amount of data in them (512MB +/- 1MB). If we consider the case where there is a small amount of data being read/deleted to this range constantly, then we might be very aggressively splitting/merging this range. The purpose of the new parameter is to protect against this case of flapping splits/merges.

erikgrinaker commented 1 year ago

It only would if the min size is set to 50% of the max size. By default it's set to 25% (128 MB), which will prevent flapping in that case: adding 1 MB to 512 MB will result in two 256 MB ranges, but deleting 1 MB will not cause them to merge -- we'll have to delete at least 128 MB. Furthermore, the deletes would only take effect once garbage collected, so the GC TTL, MVCC GC queue, and GC heuristics would further counteract this flapping.

andrewbaptist commented 1 year ago

Right - but I was proposing getting rid of the min size parameter. Currently, it is set to 1/4 of the max size to protect exactly this case as you note. The open question is if we have two adjacent ranges that are 130MB each next to each other if they should be merged (the combined size would be 260MB).

My proposal would drop MinSize completely and only have MaxSize (zone level) and MergeRatio (system level).

I think it results in 2 improvements:

1) There are confusing interlinked parameters today. No need for users of ZoneConfigurations to set two parameters that are confusing to get right today. As an example, our documentation says:

For example, the range_max_bytes and range_min_bytes variables must be set together, so when editing one value, you can use COPY FROM PARENT for the other.

This guidance is not really ideal and there is nowhere where we say to set the min value to 1/4 of the max value or what happens if this is set very small (there is an example where it is set to 0 on that same page).

2) We are somewhat asymmetric when we decide to merge/split. It seems better to base this on the "merged range" size vs the "pre-merged" size. Consider the case where you many adjacent small ranges. With a MergeRatio set to 0.75, a the end you won't have any adjacent ranges less than 0.75 and no single range >0.75. But with the current approach (or a change to allow merges if either is <min) you could end up with adjacent ranges at 0.50001 and some ranges at .99999 once merging is done. I haven't simulated what this looks like in practice, but intuitively it seems better to make this more predictable.

andrewbaptist commented 1 year ago

I wrote a small simulator to see what would happen if I started with 10,000 ranges all between 0 - 0.5 and "ran the merge queue" against them randomly. As expected both options are better than we have today, but using the "merged size" as the primary criterion ends with a better-looking distribution and less ranges.

Code below:

 import random
 import matplotlib.pyplot as plt

 # Merge if both sides < 0.25
 def can_merge_today(a, b):
     return (a < 0.25 and b < 0.25)

 # Merge if either sides < 0.25 and merged < 1
 def can_merge_old(a, b):
     return (a < 0.25 or b < 0.25) and a + b < 1

 # Merge if combined is <.75
 def can_merge_new(a, b):
     return a + b < .75

 def run_loop(check):
     # create list of 2000 random ranges between 0 and 0.25
     num_list = [random.random()/2 for _ in range(10000)]

     # loop until no more merging is possible
     merge_done = True
     while merge_done:
         merge_done = False
         i = 0
         lhs = list(range(0,len(num_list)-1))
         # process all ranges in random order.
         random.shuffle(lhs)
         for i in lhs:
             if i < len(num_list) - 1:
                 if check(num_list[i], num_list[i+1]):
                     num_list[i] += num_list[i+1]
                     num_list.pop(i+1)
                     merge_done = True
                 else:
                     i += 1

     return num_list

 fig, axis = plt.subplots(3)
 fig.tight_layout(h_pad=2)

 # create histogram of remaining values
 l = run_loop(can_merge_today)
 axis[0].hist(l, range=(0,1), bins=50)
 axis[0].set_title("Today: Num ranges " + str(len(l)))

 l = run_loop(can_merge_old)
 axis[1].hist(l, range=(0,1), bins=50)
 axis[1].set_title("Either small enough: Num ranges " + str(len(l)))

 l = run_loop(can_merge_new)
 axis[2].hist(l, range=(0,1), bins=50)
 axis[2].set_title("Cap merged: Num ranges " + str(len(l)))

 plt.show()

erikgrinaker commented 1 year ago

I see, thanks for running these simulations! Cap merged does seem like an improvement. We can discuss the merits of a new range policy (write up an issue/proposal?), which is timely as we consider larger range sizes anyway, but in the meanwhile I'd be inclined to make a backportable one-line fix for the immediate problem at hand.

cindyzqtnew commented 1 year ago

hi guys,

can i just simply deal with merging the 0-byte range by calculating if either side is below the merge threshold and the new-merged range is far below the max size of range?

erikgrinaker commented 1 year ago

can i just simply deal with merging the 0-byte range by calculating if either side is below the merge threshold and the new-merged range is far below the max size of range?

Yep. As a one-off, you can also try temporarily increasing the min/max range sizes, wait for the ranges to merge, and then reduce them again.

andrewbaptist commented 1 year ago

After discussing this internally a little. There are fundamentally three options we could take:

1) Leave that logic as it is and update the documentation to state: "The minimum size, in bytes, for a range of data in the zone before it is considered for merging. When two adjacent ranges are both less than this size, CockroachDB will merge it with that adjacent range." This has the disadvantage that we might have up to 2x as many ranges as "ideal" however is easy to understand.

2) Change the logic to merge if for any range current_range_size < minandmerged_range_size < max`. The documentation would then say: "The minimum size, in bytes, for a range of data in the zone before it is considered for merging. When two adjacent ranges can be merged and be less than this size, CockroachDB will merge those ranges." This has a number of problems detailed below.

3) Begin deprecating min_range_size. Specifically set it to 0 for new clusters/zone configurations, and possibly write a script that will zero it out if it is ~1/4 of max_range_size. Add a new parameter kv.range_merge.max_merged_ratio which will be a ratio of the max_size of the combiner range to put a cap on when merges are done instead of using min bytes at all. Since the parameter is deprecated, the documentation would reflect this, and the cluster settings page would define how this new parameter works. We would still only consider merging ranges with a size < min_range_size but the guidance would be to update existing clusters to set it to 0. If we designed this from scratch, this is likely the path we would take. Setting the config value at 0.5 gives behavior that is similar but strictly better than today (it will merge adjacent ranges that are at 0.05 and 0.4) . Setting it to 0.75 makes merges more aggressive.

Drawbacks of 2 and 3:

Both of these require reading RangeStats much more frequently than today. In order to compute if a merge should be done, it is always necessary to read the range stats of the RHS. Today we rarely do that, it is only done when the LHS is less than min_size. We could do something like cache an estimate of the RHS, but we would only do this if this was an expensive check.

Additional drawbacks of 2:

It allows the merging of ranges which results in a range close to the max size. If there are ranges at 0.2x and 0.75x of the max, they will be merged to a range of 0.95x and it is likely this will be split again soon.
It is somewhat inconsistent and confusing about when it will merge ranges. It will merge ranges at 0.2x and 0.75x => 0.95x, but not adjacent ranges that are both at 0.3x => 0.6x since neither range will be less than the min.
It is the most aggressive of merging and can move large ranges around if they are on the RHS (see the first example). As range size grows this becomes more problematic.
Keeping this parameter around without guidance about how to set it is somewhat problematic as it puts constraints on changing zone configurations for existing clusters (something we want to do in the future). Having a single parameter for zone configuration would be cleaner.

Additional drawbacks of 3:

It requires deprecating an existing parameter in zone configurations is complex. The parameter has already been made optional, but I don't think that is actually taken advantage of. This would take multiple releases to complete.

range_count

cockroachdb / cockroach

kv: lhs range with 0 bytes will never be merged #100443