basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.92k stars 534 forks source link

Leveldb compaction trigger #1016

Closed jogoncalves closed 3 years ago

jogoncalves commented 4 years ago

I've been running Riak 1.4 in production for several years now and it's the most stable cluster we have. However, as data pilled up over the years, the problem of not being able to delete keys and reclaim disk space in an acceptable time interval has become a serious problem. After we issue deletes on old data, it takes about a month before leveldb starts freeing up the disk space. In the last supported basho docs, in version 2.2.3 (https://docs.riak.com/riak/kv/latest/setup/planning/backend/leveldb/index.html), it is stated that:

Delete operations now receive priority handling in compaction selection, which means more aggressive reclaiming of disk space than in previous versions of Riak’s LevelDB backend.

Although an improvement, it still does not allow immediate space reclamation. We have started testing Riak 2.9.0 and have found that this behaviour continues. Since I can't seem to find documentation on the subject in this version, I ask that you clarify the following:

martinsumner commented 4 years ago

The grooming compaction feature that was added to leveldb is described in detail here - https://github.com/basho/leveldb/wiki/Mv-aggressive-delete.

It has only one configurable setting -

https://github.com/basho/eleveldb/blob/develop-2.9/priv/eleveldb.schema#L186-L193

It can only be made more aggressive by reducing this number, but I've never experimented with this setting.

It should happen automatically in the background, so there's no command to trigger a grooming compaction. They should go on all the time. It might be some time after upgrading that the feature takes effect - as initially none of your SST fils will have a tombstone count.

Immediate space reclamation can never happen in an LSM-tree, none of the backends support immediate reclamation in Riak. They all have some form of compaction/merge activity to recover space.

Leveled is more tuneable in terms of compaction. Leveled only puts keys and metadata into the LSM tree, so the LSM tree part is much smaller. The actual objects are kept in a separate journal, and that journal has its own compaction process to reclaim space.

These are the configuration options the journal compaction has:

https://github.com/martinsumner/leveled/blob/master/priv/leveled.schema#L95-L149

Leveled is optimised to maximise throughput not performance, and is best suited for relatively large objects (> 4KB). So although it has more flexible compaction which you can tune to free space more quickly, it tends to give Riak higher median latency when under light loads.

It has configurable compression - either LZ4 (like leveldb) or zlib (the default, which can lead to less disk space occupied).

The final question, how do you deal with never ending space requirements, it just depends. Most customers continue to scale horizontally as their cluster grows. Some have been able to run their clusters with acceptable performance on very low costs disks, and so the relative cost of storage becomes less relevant. leveldb has an additional feature here which may help - tiered storage. This allows lower levels on the LSM tree (which tend to be accessed less frequently) on cheaper disks - https://docs.riak.com/riak/kv/latest/setup/planning/backend/leveldb/index.html#tiered-storage.