facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.52k stars 6.3k forks source link

Merge operation performance relative to scan #12045

Open patelprateek opened 11 months ago

patelprateek commented 11 months ago

I am trying to build an inverted index on top of rocksdb . There are two implementation that comes to my mind 1) for a given term : store the sorted posting list in a single K-V : key is the term , value is sorted posting list , requires one lookup to get the posting list , the idea is as i stream through data i use merge operation 2) create multiple keys : term#doc_id as the key , value is probably empty. This way i would need to scan a given prefix.

For the first operation , is there some particular merge operator i will need to implement , basically i want a sorted list , or unordered list should also be fine . Are there any existing operators For the second operation my concern is regarding if the scan operation takes too long.

Are there any other alternatives in general on how to model inverted index on top of rocksdb (i assume this is a common use case)

ajkr commented 11 months ago

Those are the two implementations I have seen.

In one use case, we tried 1. before and are trying 2. now. In our implementation of 1., a merge operand was an ordered list of posting additions and deletions. Merge operands written by users via Merge() would typically be single-element lists, while merge operands written by compaction via PartialMergeMulti() would typically be multiple-element lists. A base value was an ordered list of postings. Base values were produced by FullMergeV2(), either during bottommost compaction or a query like Get().

For us, calling Get() was too inefficient because it involved merging across all levels into a single result buffer. So we stopped using Get(). Instead, we used GetMergeOperands() and made the application deal with raw add/delete lists. But it was still too slow, memory inefficient, and I/O parallelism was limited. We didn't want to spend a lot of time optimizing such a rarely used API as GetMergeOperands() and introducing related APIs like MultiGetMergeOperands(), so changed the approach to 2.

Approach 2. definitely faces performance challenges as you mentioned. It is much worse than 1. on dimensions like memory efficiency. But, something as generic as querying a prefix feels worthwhile to optimize. I don't know how much better we will end up making it, but that is what we are trying now.

axnsan12 commented 9 months ago

IMO MultiGetMergeOperands or some other related optimisations would clearly be the best way to improve RocksDB for posting lists specifically. The prefix scan approach has no hope to compete for cases with lots of documents per term when it has to duplicate keys and incur metadata overhead for every key-value pair.

The merge operator method also has the advantage of allowing integer compression schemes like streamvbyte or bitpacking.