BurntSushi / fst

Represent large sets and maps compactly with finite state transducers.
The Unlicense
1.77k stars 125 forks source link

key positions in fst #48

Closed alkis closed 4 years ago

alkis commented 6 years ago

Is it possible to:

If the above is possible then more possibilities open up:

BurntSushi commented 6 years ago

I think it would be much more constructive to say more about the problem you're trying to solve. You hinted at it with "this is very common when building search indices," but I'd appreciate more elaboration. For example, AFAIK, Lucene uses FSTs and doesn't use/need this feature. If you don't elaborate on the problem you're trying to solve, then we risk falling victim to the XY problem.

Have you read my blog post on the topic? http://blog.burntsushi.net/transducers/ --- Do you see a way to add positions to the FSM? I don't. The only thing I can really think of is if you layer positions into your key structure. e.g., if your key is {key bytes}{sentinel}{position}, then naively, looking up a key translates to a prefix query that should contain exactly one hit. You can then decode the position from that hit. For performance reasons, you'll probably want to implement this in terms of the fst::raw::Node structure, since in general, a prefix query will need to allocate space for a stack on the heap. As for returning a key given its position, no, I don't see a way to do that without a separate indexing structure. You might be able to store the position as the mapped value for a specific key, see #16.

alkis commented 6 years ago

One of the problems I am solving is building a file format that maps string keys to string values, ordered by lexicographic keys (think of https://github.com/google/leveldb/tree/master/table). The file should support fast iteration and fast lookup. Iteration after lookup should be possible. Typical usecases are iteration from start to end, random lookups, small forward jumps followed by iterating over some kv pairs. Many files are opened in parallel and merged together. The files are usually huge (tens/hundreds of GB) so it is imperative they can be accessed without loading them in memory. An index of mapping keys to offsets of values can be made resident in memory and it is always better if it is small. Also it would be nice if the index can also be served directly from disk (like FST does) in case memory is at a premium.

I know how to make a very efficient file format mapping positions to string values (think of it as a file backed Vec<String>). Based on that file format adding a mapping of key to position can work well. A binary tree is one option but FSTs seem to fit the usecase better because they compress better (which is desirable). They are also more powerful since they support a richer set of operations than what binary trees support.

The above is one of the usecases. Others include bimaps of String <-> int where int is on a dense domain. Examples are: term <-> id, doc <-> id.

BurntSushi commented 6 years ago

@alkis Could you say more about how your use case requires key position information? What specific operations do you want to support that require key positions?

alkis commented 6 years ago

Assume the file format described above. We have millions of (key, value) pairs where keys are ordered lexicographically. Keys are ~100 bytes and values are several KB with upper bound of 2GB. This in itself cannot be described by FST (value cannot be an arbitrary length string) unless we do something smarter.

Assume we want the power of FST to search over the key space but still get to our value. I can think of a few different ways of achieving this:

The reason I think that getting position information could be more efficient than storing the offset itself, is because there is already some kind of ordering implicit in the fst. In theory it could be leveraged to get the position but I didn't spend enough time thinking about it to find out how (if at all possible). Hence I am asking for your opinion :-)

fulmicoton commented 6 years ago

@alkis Have you solved your problem? It is similar to #16

If your function f(key) ->value is monotonic, then you can use the snippet in #16 to get your key from your value.

That means you can either store the position as you suggested (as it is monotonic by definition), or offsets if your values are stored one after the other in a file, and ordered by their keys.

jhdub23 commented 5 years ago

I've started using this excellent library, and I also have the need for getting the reverse ordinal mapping. The code snippet in #16 is useful for a monotonic f(key)->value. Unfortunately, our application cannot guarantee this.

It would be useful to be able to encode the ordinal position on each edge as an additional u64, similar to how the u64 value portion of the key-value pair is encoded. This would increase the overall size but give the reverse mapping/position capability.

BurntSushi commented 4 years ago

I'm going to close this. I think I understand the desire here, but I am personally not willing to further complicate the implementation (and API) with this. I am definitely willing to provide the reverse mapping in the monotonic case---and I think that probably covers most things---but anything more probably should just be a new library. I could potentially be accepting of something like this if someone could come up with a simple patch that were easy to maintain, but I'm skeptical of that.

alkis commented 4 years ago

@fulmicoton yes in the end it was solved with a binary tree/trie hybrid. It doesn't support the full power of an FST but only queries you can do on binary trees.