Closed mheinzel closed 2 months ago
Posting this here so I don't forget: I was reading the Monkey paper and found this quote:
If multiple runs that are being sort-merged contain entries with the same key, only the entry from the most recently-created (youngest) run is kept because it is the most up-to-date. Thus, the resulting run may be smaller than the cumulative sizes of the original runs. When a merge operation is finished, the resulting run moves to Level i + 1 if Level i is at capacity.
Specially the last part we do differently: we look only at the size of the run to determine whether it should stay, and we don't look at the overall level capacity. Would it be a reasonable alternative to promote runs based on the level capacity, not just their own size?
Would it be a reasonable alternative to promote runs based on the level capacity, not just their own size?
Ah interesting. Not sure I fully understand the context (maybe it's just talking about levelling, where this is pretty much what we do already?), but should be possible for tiering as well. However, it weakens the invariants on run sizes quite a bit.
Would it be a reasonable alternative to promote runs based on the level capacity, not just their own size?
Ah interesting. Not sure I fully understand the context (maybe it's just talking about levelling, where this is pretty much what we do already?), but should be possible for tiering as well. However, it weakens the invariants on run sizes quite a bit.
Indeed, it seems we already do this for levelling, but for tiering we only look at the run size without looking at the overall level capacity. In theory this can lead to holding back a run even when the level would then be at capacity (the level being the previous one, since the incoming run still counts as being on the previous level technically)
Maybe taking into account level capacity would remove the need for supplying >1
credits in creditsForMerge
, but I haven't thought it through
Maybe, if you can fit in #310 (comment), that would be nice, otherwise could you create an issue for it that links to my comment?
I can open an issue and quickly outline various ideas for changing when to start a merge. Thinking about that again, "if Level i is at capacity" might just mean "if there are 4 runs", it doesn't say whether capacity refers to number of entries or number of runs. But I should have another look at the paper at some point.
I adressed the feedback and rebased. What do you think?
Maybe, if you can fit in #310 (comment), that would be nice, otherwise could you create an issue for it that links to my comment?
I can open an issue and quickly outline various ideas for changing when to start a merge.
Yeah, let's open an issue
Thinking about that again, "if Level i is at capacity" might just mean "if there are 4 runs", it doesn't say whether capacity refers to number of entries or number of runs. But I should have another look at the paper at some point.
No, capacity talks about the entries in the run, not the number of runs
Probably easiest to review commit by commit.
I managed to find a few invariants that too tight, or looser than necessary. However, this required adding handwritten tests, as the lockstep test coverage is not great.
The fix for incorrect assumptions violated by 5-way merges (holding back an underfull run) are simply addressed by providing more merge credits, but there are alternatives we can think about afterwards (e.g. #311)