Open cindyzqtnew opened 1 year ago
Hello, I am Blathers. I am here to help you get the issue triaged.
Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.
I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
Hello, thank you for your report.
These ranges are marked to have forced split points and these forced splits prevent them from being automatically merged.
We force splits for a number of system ranges, including most ranges with low IDs.
Some of them can be merged nonetheless using the cluster setting spanconfig.storage_coalesce_adjacent.enabled
. Some others will always remain split. In any case, I believe this is expected behavior.
Does this cause a problem in your deployment?
Hi @knz ,
I don't mean the sys ranges, but just the user data ranges.
In my cluster, users insert rows and then delete them later. Thus there are more and more 0-byte ranges which are not serial. Just like some holes in the overall key spans. We want to merge them since there is no data in the ranges, but maintaining them costs resource and the number of ranges will grow constantly which may lead to stability issue.
If I enable the setting spanconfig.storage_coalesce_adjacent.enabled, will the system ranges you mentioned be merged but should be maintained split?
cc @cockroachdb/replication
The merge should occur if either side is below the min size threshold, instead of when both are.
@kvoli - we should probably discuss the approach here.
It's not quite as simple as making it so that either side is below the minimum. If one size is close to the maximum, then we avoid merging so we don't split soon afterward. The way this is currently configured is a little wonky since there are two closely related but different checks to prevent merge-then-split.
A few lines further down https://github.com/cockroachdb/cockroach/blob/00ec8d330a18d77e3aaa2e6d02e79f2d03a01404/pkg/kv/kvserver/merge_queue.go#L292 where we check if it would split again. The "split again" check is not great for preventing splits as it might allow us to get to "very close" to the split limit so that the next write causes a split.
Short term, setting the min bytes up large in this cluster (even as high as the max bytes) will force those ranges to merge, however a bigger refactor of this is probably due.
One proposal would be to have RangeMaxBytes
as we do today and a second parameter RangeMinUtilizationRatio
which would by default be something like 0.5 and signifies that if adjacent ranges can be merged and the merged size is less than that ratio than it will merge adjacent ranges. This second parameter could be a system level configuration rather than a zone configuration. I think this would be less confusing for users and cleaner to implement. Today RangeMinBytes
is a very confusing parameter.
It's not quite as simple as making it so that either side is below the minimum. If one size is close to the maximum, then we avoid merging so we don't split soon afterward. The way this is currently configured is a little wonky since there are two closely related but different checks to prevent merge-then-split.
That's fine though -- there's no point merging only to split again, and we can't rebalance the ranges without merging them first. But in the common case where the merged range won't exceed the max size, we should merge them.
If either the LHS
or RHS
is less than RangeMinBytes
and not shouldSplit(LHS + RHS)
you're worried we may be close to the split threshold @andrewbaptist if we do merge?
I think adding a cluster setting (or perhaps span config field) like you suggested that adds an additional conjunction for the min utilization ratio would be a quick fix, along with changing the min check to be a disjunction instead of conjunction ((LHS < min && RHS < min)
to (LHS < min || RHS < min)
).
So we end up with
(LHS < min || RHS < min) && !shouldSplit(LHS+RHS) && (minUtilizationRatio * maxBytes > (LHS+RHS))
Today by default we have RangeMinBytes = RangeMaxBytes/4
There are really 5 scenarios, that are worth considering:
1) LHS and RHS are both less than the RangeMinBytes
- in this case we merge today and should continue to.
2) One (but not both) of LHS and RHS are less than RangeMinBytes
- but the merged range will be less than RangeMaxBytes/2
- This should merge but doesn't merge today.
3) One (but not both) of LHS and RHS are less than RangeMinBytes
- but the merged range will be between RangeMaxBytes/2 and RangeMaxBytes
- This doesn't merge today and its not clear if it should.
4) Both LHS and RHS are greater than RangeMinBytes
but the merged range will be between RangeMaxBytes/2 and RangeMaxBytes
- This doesn't merge today and its not clear if it should.
5) The merged range will be greater than RangeMaxBytes
- don't merge.
The proposal is to have a configuration with a ZONE CONFIG RangeMaxBytes
(default 512MB) and a SYSTEM CONFIG RangeMaxMergeRatio
which by default would be 0.5. We could likely set the RangeMaxMergeRatio
higher if we broke this out this way (0.75?). Then the cases are simpler to think through both internally and for end users.
along with changing the min check to be a disjunction instead of conjunction
I don't think the min check makes sense anymore if we add this parameter. Especially for case 4 I describe above.
I'm not sure I understand the motivation for this new parameter. If we have one range that's 5% full and one range that's 90% full, you're right that the combined range might soon reach 100% and get split. But if that happens, we'll end up with two 50% full ranges instead of one 5% full and one 95% full. Surely that's an improvement?
My motivation was to try to avoid more frequent splits/merges than we have today. You do bring up an interesting case that I hadn't thought about which is "replica neighbor rebalancing". Specifically, there are a few cases close to what you described:
1) Replica 1 at 5% max size, Replica 2 at 90% max size - join (size of 95%) and might split soon. 2) Replica 1 at 15% max size, Replica 2 at 90% max size - join (size of 105%) and then immediately split (?)
We don't do this today and I haven't thought it through, but it does seem useful. By setting to the parameter to 0.5 we would get a "strict" improvement on today's behavior. By setting it to 1.0 we would get the behavior of joining up to the max split size. There wouldn't be a way to get the behavior described by my second case, but that might not be important to do.
I'm not sure I understand the motivation for this new parameter. If we have one range that's 5% full and one range that's 90% full, you're right that the combined range might soon reach 100% and get split. But if that happens, we'll end up with two 50% full ranges instead of one 5% full and one 95% full. Surely that's an improvement?
Consider the case where two adjacent ranges have a "mostly static" amount of data in them (512MB +/- 1MB). If we consider the case where there is a small amount of data being read/deleted to this range constantly, then we might be very aggressively splitting/merging this range. The purpose of the new parameter is to protect against this case of flapping splits/merges.
It only would if the min size is set to 50% of the max size. By default it's set to 25% (128 MB), which will prevent flapping in that case: adding 1 MB to 512 MB will result in two 256 MB ranges, but deleting 1 MB will not cause them to merge -- we'll have to delete at least 128 MB. Furthermore, the deletes would only take effect once garbage collected, so the GC TTL, MVCC GC queue, and GC heuristics would further counteract this flapping.
Right - but I was proposing getting rid of the min size parameter. Currently, it is set to 1/4 of the max size to protect exactly this case as you note. The open question is if we have two adjacent ranges that are 130MB each next to each other if they should be merged (the combined size would be 260MB).
My proposal would drop MinSize completely and only have MaxSize (zone level) and MergeRatio (system level).
I think it results in 2 improvements:
1) There are confusing interlinked parameters today. No need for users of ZoneConfigurations to set two parameters that are confusing to get right today. As an example, our documentation says:
For example, the range_max_bytes and range_min_bytes variables must be set together, so when editing one value, you can use COPY FROM PARENT for the other.
This guidance is not really ideal and there is nowhere where we say to set the min value to 1/4 of the max value or what happens if this is set very small (there is an example where it is set to 0 on that same page).
2) We are somewhat asymmetric when we decide to merge/split. It seems better to base this on the "merged range" size vs the "pre-merged" size. Consider the case where you many adjacent small ranges. With a MergeRatio
set to 0.75, a the end you won't have any adjacent ranges less than 0.75 and no single range >0.75. But with the current approach (or a change to allow merges if either is <min) you could end up with adjacent ranges at 0.50001 and some ranges at .99999 once merging is done. I haven't simulated what this looks like in practice, but intuitively it seems better to make this more predictable.
I wrote a small simulator to see what would happen if I started with 10,000 ranges all between 0 - 0.5 and "ran the merge queue" against them randomly. As expected both options are better than we have today, but using the "merged size" as the primary criterion ends with a better-looking distribution and less ranges.
Code below:
import random
import matplotlib.pyplot as plt
# Merge if both sides < 0.25
def can_merge_today(a, b):
return (a < 0.25 and b < 0.25)
# Merge if either sides < 0.25 and merged < 1
def can_merge_old(a, b):
return (a < 0.25 or b < 0.25) and a + b < 1
# Merge if combined is <.75
def can_merge_new(a, b):
return a + b < .75
def run_loop(check):
# create list of 2000 random ranges between 0 and 0.25
num_list = [random.random()/2 for _ in range(10000)]
# loop until no more merging is possible
merge_done = True
while merge_done:
merge_done = False
i = 0
lhs = list(range(0,len(num_list)-1))
# process all ranges in random order.
random.shuffle(lhs)
for i in lhs:
if i < len(num_list) - 1:
if check(num_list[i], num_list[i+1]):
num_list[i] += num_list[i+1]
num_list.pop(i+1)
merge_done = True
else:
i += 1
return num_list
fig, axis = plt.subplots(3)
fig.tight_layout(h_pad=2)
# create histogram of remaining values
l = run_loop(can_merge_today)
axis[0].hist(l, range=(0,1), bins=50)
axis[0].set_title("Today: Num ranges " + str(len(l)))
l = run_loop(can_merge_old)
axis[1].hist(l, range=(0,1), bins=50)
axis[1].set_title("Either small enough: Num ranges " + str(len(l)))
l = run_loop(can_merge_new)
axis[2].hist(l, range=(0,1), bins=50)
axis[2].set_title("Cap merged: Num ranges " + str(len(l)))
plt.show()
I see, thanks for running these simulations! Cap merged does seem like an improvement. We can discuss the merits of a new range policy (write up an issue/proposal?), which is timely as we consider larger range sizes anyway, but in the meanwhile I'd be inclined to make a backportable one-line fix for the immediate problem at hand.
hi guys,
can i just simply deal with merging the 0-byte range by calculating if either side is below the merge threshold and the new-merged range is far below the max size of range?
can i just simply deal with merging the 0-byte range by calculating if either side is below the merge threshold and the new-merged range is far below the max size of range?
Yep. As a one-off, you can also try temporarily increasing the min/max range sizes, wait for the ranges to merge, and then reduce them again.
After discussing this internally a little. There are fundamentally three options we could take:
1) Leave that logic as it is and update the documentation to state: "The minimum size, in bytes, for a range of data in the zone before it is considered for merging. When two adjacent ranges are both less than this size, CockroachDB will merge it with that adjacent range." This has the disadvantage that we might have up to 2x as many ranges as "ideal" however is easy to understand.
2) Change the logic to merge if for any range current_range_size
< minand
merged_range_size < max`. The documentation would then say: "The minimum size, in bytes, for a range of data in the zone before it is considered for merging. When two adjacent ranges can be merged and be less than this size, CockroachDB will merge those ranges." This has a number of problems detailed below.
3) Begin deprecating min_range_size
. Specifically set it to 0 for new clusters/zone configurations, and possibly write a script that will zero it out if it is ~1/4 of max_range_size. Add a new parameter kv.range_merge.max_merged_ratio
which will be a ratio of the max_size
of the combiner range to put a cap on when merges are done instead of using min bytes at all. Since the parameter is deprecated, the documentation would reflect this, and the cluster settings page would define how this new parameter works. We would still only consider merging ranges with a size < min_range_size
but the guidance would be to update existing clusters to set it to 0. If we designed this from scratch, this is likely the path we would take. Setting the config value at 0.5 gives behavior that is similar but strictly better than today (it will merge adjacent ranges that are at 0.05 and 0.4) . Setting it to 0.75 makes merges more aggressive.
Drawbacks of 2 and 3:
RangeStats
much more frequently than today. In order to compute if a merge should be done, it is always necessary to read the range stats of the RHS. Today we rarely do that, it is only done when the LHS is less than min_size. We could do something like cache an estimate of the RHS, but we would only do this if this was an expensive check.Additional drawbacks of 2:
Additional drawbacks of 3:
Describe the problem
There are a lot of 0-byte ranges in the cluster but not serial. For example,
R1: 0 bytes R2: 10 bytes R3: 0 bytes R4: 10 bytes ....
If the merge-threshold is 5 bytes, then in the merge queue( pkg/kv/kvserver/merge_queue.go)
R1 will never be merged, since R2 is bigger than the threshold.
What should we deal with such case?
To Reproduce
What did you do? Describe in your own words.
If possible, provide steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Additional data / screenshots If the problem is SQL-related, include a copy of the SQL query and the schema of the supporting tables.
If a node in your cluster encountered a fatal error, supply the contents of the log directories (at minimum of the affected node(s), but preferably all nodes).
Note that log files can contain confidential information. Please continue creating this issue, but contact support@cockroachlabs.com to submit the log files in private.
If applicable, add screenshots to help explain your problem.
Environment:
cockroach sql
, JDBC, ...]Additional context What was the impact?
Add any other context about the problem here.
Jira issue: CRDB-26438