Closed alkis closed 4 years ago
I think it would be much more constructive to say more about the problem you're trying to solve. You hinted at it with "this is very common when building search indices," but I'd appreciate more elaboration. For example, AFAIK, Lucene uses FSTs and doesn't use/need this feature. If you don't elaborate on the problem you're trying to solve, then we risk falling victim to the XY problem.
Have you read my blog post on the topic? http://blog.burntsushi.net/transducers/ --- Do you see a way to add positions to the FSM? I don't. The only thing I can really think of is if you layer positions into your key structure. e.g., if your key is {key bytes}{sentinel}{position}
, then naively, looking up a key translates to a prefix query that should contain exactly one hit. You can then decode the position from that hit. For performance reasons, you'll probably want to implement this in terms of the fst::raw::Node
structure, since in general, a prefix query will need to allocate space for a stack on the heap. As for returning a key given its position, no, I don't see a way to do that without a separate indexing structure. You might be able to store the position as the mapped value for a specific key, see #16.
One of the problems I am solving is building a file format that maps string keys to string values, ordered by lexicographic keys (think of https://github.com/google/leveldb/tree/master/table). The file should support fast iteration and fast lookup. Iteration after lookup should be possible. Typical usecases are iteration from start to end, random lookups, small forward jumps followed by iterating over some kv pairs. Many files are opened in parallel and merged together. The files are usually huge (tens/hundreds of GB) so it is imperative they can be accessed without loading them in memory. An index of mapping keys to offsets of values can be made resident in memory and it is always better if it is small. Also it would be nice if the index can also be served directly from disk (like FST does) in case memory is at a premium.
I know how to make a very efficient file format mapping positions to string values (think of it as a file backed Vec<String>
). Based on that file format adding a mapping of key to position can work well. A binary tree is one option but FSTs seem to fit the usecase better because they compress better (which is desirable). They are also more powerful since they support a richer set of operations than what binary trees support.
The above is one of the usecases. Others include bimaps of String
<-> int
where int
is on a dense domain. Examples are: term <-> id, doc <-> id.
@alkis Could you say more about how your use case requires key position information? What specific operations do you want to support that require key positions?
Assume the file format described above. We have millions of (key, value) pairs where keys are ordered lexicographically. Keys are ~100 bytes and values are several KB with upper bound of 2GB. This in itself cannot be described by FST (value cannot be an arbitrary length string) unless we do something smarter.
Assume we want the power of FST to search over the key space but still get to our value. I can think of a few different ways of achieving this:
[start, end)
offsets (assuming we extend the fst to support 2x u64
as value)start
offset and at that offset we store the length of the value so we derive [start, end)
start
offset of each value plus some gymnastics to get the next key offset from which we can derive [start, end)
position
(assuming this is much cheaper than encoding start
itself) plus an offset map which maps position
to start
offset. The latter is thoroughly researched and there are many high performance/quasi optimal representation implementations for it. For example : https://github.com/ot/ds2i.The reason I think that getting position information could be more efficient than storing the offset itself, is because there is already some kind of ordering implicit in the fst. In theory it could be leveraged to get the position but I didn't spend enough time thinking about it to find out how (if at all possible). Hence I am asking for your opinion :-)
@alkis Have you solved your problem? It is similar to #16
If your function f(key) ->value
is monotonic, then you can use the snippet in #16 to get your key from your value.
That means you can either store the position as you suggested (as it is monotonic by definition), or offsets if your values are stored one after the other in a file, and ordered by their keys.
I've started using this excellent library, and I also have the need for getting the reverse ordinal mapping. The code snippet in #16 is useful for a monotonic f(key)->value. Unfortunately, our application cannot guarantee this.
It would be useful to be able to encode the ordinal position on each edge as an additional u64, similar to how the u64 value portion of the key-value pair is encoded. This would increase the overall size but give the reverse mapping/position capability.
I'm going to close this. I think I understand the desire here, but I am personally not willing to further complicate the implementation (and API) with this. I am definitely willing to provide the reverse mapping in the monotonic case---and I think that probably covers most things---but anything more probably should just be a new library. I could potentially be accepting of something like this if someone could come up with a simple patch that were easy to maintain, but I'm skeptical of that.
@fulmicoton yes in the end it was solved with a binary tree/trie hybrid. It doesn't support the full power of an FST but only queries you can do on binary trees.
Is it possible to:
contains
? That is instead ofbool
returnOption<usize>
whereNone
means not found andSome(pos)
is the position of the key in the lexicographic order of keys represented by the automatonops::Index
for set, such thatmyset[pos]
returns the key atpos
.If the above is possible then more possibilities open up: