[GC] Value log files under 10 MB are not garbage collected

schomatis commented 6 years ago

Value log file sizes (ValueLogFileSize) can be set as low as 1 MB but the GC seems to skip files under 10 MB, it is not clear (to me) what is the purpose of that threshold, should value log file sizes under 10 MB be avoided?

manishrjain commented 6 years ago

Two things:

We don't touch the latest value log file for GC.
Value log files below 10MB should be avoided. They're just too small, and if you have small values, they'd be better fit into the LSM tree.

arctica commented 6 years ago

This is surprising behavior. If badger doesn't GC files under 10MB then at least it shouldn't accept such values in the settings. We also have a growing number of 8MB value log files and I only stumbled upon this issue by accident. I thought they are not cleaned up because of the low write volume in the LSM in our db.

I respectfully disagree with the arbitrary limit of 10MB btw. It doesn't need to mean that the values are small, it can just mean that there is a low amount of writes happening. It might be totally reasonable to have 1MB value log files in order to not waste space unnecessarily. It also allows the GC to work in smaller, more frequent steps.

schomatis commented 6 years ago

Yes, and more important from my perspective, this undocumented and hard-coded value makes it much more difficult to:

Understand how GC works.
Debug GC at a small scale level, many of the bugs that can be found (e.g., #495) are of a logical nature and are pretty much independent of value log file sizes (e.g., from 1GB to 10KB).

arctica commented 6 years ago

The current code checks for vlog files to have at least 10k entries and 10MB in size. Instead of these arbitrary numbers, I would suggest to change the logic to:

If garbage > gc_threshold1 || (garbage > gc_threshold2 && last_modified > gc_minAge) { collect() }

So one could set that if a file has either 50% of garbage in it or has more than 10% garbage in it and has been last modified more than 24h ago, then gc collect it.

This would allow files with high write volume to be garbage collected quickly enough but also allow for old files to be eventually trimmed.

If you want to allow users maximum flexibility, which we would prefer, then maybe allow specifying a callback function to the GC method which will be called with information about the file like size, amount of garbage, last modification time etc. and would return a boolean indicating if the file should be collected or not. There's no one-size-fits-all so being able to specify the GC strategy would be ideal. I think it also matches the current idea that Badger asks the user to run the GC whenever he thinks is ideal. It would take this idea to the logical conclusion.

schomatis commented 6 years ago

being able to specify the GC strategy would be ideal. I think it also matches the current idea that Badger asks the user to run the GC whenever he thinks is ideal.

Agreed.

manishrjain commented 6 years ago

So one could set that if a file has either 50% of garbage in it or has more than 10% garbage in it and has been last modified more than 24h ago, then gc collect it.

How do we know that the value has 50% garbage in it? We can either iterate over the whole file to validate each KV pair, which is resource consuming. Or, we do sampling, which is what we're doing.

The current minimum sampling size is set to 10MB or 10K entries minimum, for that sample to be considered "valid". We could tweak that sampling size if that's all you're looking for. Or, we could enforce that the value log files be at least 16MB in size, which I think might be a better option.

What's the reason you're running with value log file sizes so low? Can't you instead just tweak the number of entries per value log threshold?

arctica commented 6 years ago

The current minimum sampling size is set to 10MB or 10K entries minimum, for that sample to be considered "valid". We could tweak that sampling size if that's all you're looking for.

If a file is smaller than 10MB or has fewer than 10K entries then you effectively scan it in whole and have a 100% accurate number for the amount of garbage it contains. It doesn't invalidate the sample. To the contrary, it turns a sample into a 100% accurate measurement.

Allowing arbitrarily small value log files just would result in the GC being more precise and having to compact smaller files so we can trigger it more often.

Or, we could enforce that the value log files be at least 16MB in size, which I think might be a better option.

I have to disagree with that. Why would 16MB be a better option than say 8MB, or 4MB or 32MB? It's an arbitrary value and Badger already chose the right way in letting users specify the file sizes.

What's the reason you're running with value log file sizes so low? Can't you instead just tweak the number of entries per value log threshold?

Using the max number of entries per value log does the same thing as setting a max size. Not sure what that would change.

The reason for running with low log file sizes is to be able to run more efficently with databases that don't necessarily grow much in terms of dataset. This is amplified due to currently there being no real way to have runtime control over the LSM compaction which in turn is responsible for providing stats that the value log GC relies upon. It would also allow the value GC to do its work in smaller steps that can be run more frequently and therefor not result in huge spikes of activity with big latency impact.

manishrjain commented 6 years ago

Created a PR to fix the 10MB limit. Can you try this out?

https://github.com/dgraph-io/badger/pull/501

If you want to ensure that your value log GCs are efficient, you should tweak the ValueLogMaxEntries instead of the ValueLogFileSize. The former is the one which truly control how efficiently is a value log GCed.

manishrjain commented 6 years ago

@arctica : Let me know if you intend to try this out (would be good if you do). Otherwise, I'll just submit this in a few days.

arctica commented 6 years ago

@manishrjain Thanks for working on this. As far as I can see, this PR changes the skip conditions to "If we sampled less than 1000 entries AND less than 7.5% of the file". There are three conditions for sampling to stop: 1. We reached 10k entries 2. We've reached 10% of the file 3. It took more than 10 seconds

In order for the new skip conditions to be true, 1. and 2. must have not been the reasons for stopping the sampling and only the 10s timeout could have been the reason. There are of course cases where sampling could take a too long time but we still got enough samples to make a meaningful decision. So from that point of view, I think this PR does improve the logic significantly.

But at the same time, I think it's not an ideal solution. There are still usage patters for Badger where this GC strategy wont be ideal. I'll give you one example: a queue. So you write entries and they either get deleted (via popping from the queue) or expire pretty much in-order. That means we continously write stuff to the value log and the tail of the value in terms of still-valid entries also moves forward. Ideally we'd wait until a value log is pretty much 100% garbage and throw the whole thing away instead of just collecting parts of the file and writing the rest again back to disk. So one might think "OK, let's just set the GC threshold to 0.9" but that would still cause the current GC to scan files over and over for no good reason. If you want to discard garbage relatively quickly and in small sizes, then you might run the GC every 5 minutes which means unless a value log had 90% garbage during those 5 minutes, you just waste time sampling the files. Especially because the file to be GC'd is the oldest file but we also sample every time another random one.

It's not a huge issue by any means. The most important issue with the arbitrary 10mb limit is being addressed. What I wanted to illustrate with my comment was that giving the user the option to specify the GC logic can lead to significant improvements in efficiency by avoiding unnecessary work. Of course, Badger should still retain its good-enough-for-most-cases default GC behaviour.

PS: one important information for users of Badger would be to know if the DB gets locked (reads? writes?) when the GC runs.

PPS: I think in order to get statistically significant enough numbers, stopping scanning after 1k instead of 10k entries should be good enough. In the same way I would put the lower threshold on entries scanned at 500 instead of 1k.

manishrjain commented 6 years ago

This should be fixed by #501 .

dgraph-io / badger

[GC] Value log files under 10 MB are not garbage collected #442