Open ccollie opened 1 month ago
p.s. I realize i could use the prefix
iterator to gather all metrics with a given label. I also wanted a separate key to account for labels with high cardinality values (which admittedly is not a good idea)
Hey @ccollie, since ART is a prefix tree when you try to insert a key that is prefix of other this will result in a error during insertion. For example "hell" is prefix of "hello", so when inserting there is not enough bytes to distinguish between this two.
If you know your labels are ascii only use a CString
(you can convert from String
<-> CString
), this allows you to use the insert
method, since CString
have a null byte at the end it's impossible for them to be prefix of each other. If the strings need to be utf-8, the paper recommends using sort strings ucol_getSortKey
, unfortunatly icu4x doesn't implement somthing similar to this, so you need to use the C bindings.
The trait tha dictates if a type is safe to not use the try_*
methods is NonPrefixBytes. As you can see this trait is implement for types that are sure to not generate prefixes.
If you somehow find a way to encode your utf-8 strings you can use BytesMapping to map your keys. Or create a type and implement AsBytes for this type
Metric names and values are utf-8. I'm hesitant to bring in something like ICU for this though. What do you think of my idea of using prefix searching to aggregate values ? In that case, keys would be like
region=us-east-1 region=us-west-1 environment=staging | production | test etc
Another thought. I can suffix my keys with a non-utf8 sentinel byte if that makes it function similar to Cstring
IDK if you are going to have more keys than this, but it's obvious that this is ascii only so CString
would work. But yeah using prefix iterator is a good idea, recently @declanvk reimplemented it and it's even faster, it's not yet published, but it will at some point.
Using a sentinel byte is a good idea and to be type safe I would recommend using BytesMapping
.
Also in the new version we will have a range
iterator, we also have a fuzzy
iterator, they may help you with your task. Also happy to see someone giving this crate a shot.
If you are nightly user you can enable the nightly flag and enjoy a massive speed up, also recommend compile with target-cpu=native
to use simd
Also in the new version we will have a
range
iterator, we also have afuzzy
iterator, they may help you with your task. Also happy to see someone giving this crate a shot. If you are nightly user you can enable the nightly flag and enjoy a massive speed up, also recommend compile withtaget-cpu=native
to use simd
Very nice ! I'm also a nightly
user, and since it's a metric library, I'm targeting simd as well.
Last question: is there a cheap way to get a count for entries with a given prefix ? If so I'd like to raise it as a feature request.
Another thought. I can suffix my keys with a non-utf8 sentinel byte if that makes it function similar to Cstring
Yeah typically if you're going to have variable length keys, you're going to need some sort of suffix value that is not present in any other part of the key. I like your idea of using non-UTF8 sentinel byte, or you could append a \0
if you're confident that character will never be present in your inputs.
This does make the string a sorta subset of UTF-8.
Last question: is there a cheap way to get a count for entries with a given prefix ? If so I'd like to raise it as a feature request.
At the moment no, the internal nodes of the trie don't track "number of leaf node descendants" as a field. The check for number of entries would need to essentially count the leaf nodes. Feel free to cut a separate ticket as a feature request (doesn't need to be too detailed), I'd be happy to discuss that feature more. I think it would be a tradeoff between increasing the size of the trie inner nodes and being able to speed up "num children" functions for the various iterators
it's not yet published, but it will at some point.
I'm working on this this week, I'll try to cut a release quickly!
@declanvk was faster than me haha, but yeah we don't keep track of this number only at the root, but with the new implementation doing a count
at the prefix iter should be way faster, since it would be basically iterating over a linked list
Last question: is there a cheap way to get a count for entries with a given prefix ? If so I'd like to raise it as a feature request.
At the moment no, the internal nodes of the trie don't track "number of leaf node descendants" as a field. The check for number of entries would need to essentially count the leaf nodes. Feel free to cut a separate ticket as a feature request (doesn't need to be too detailed), I'd be happy to discuss that feature more. I think it would be a tradeoff between increasing the size of the trie inner nodes and being able to speed up "num children" functions for the various iterators
I wasn't expecting tracking descendents, only that we iterate nodes and sum its children. I'll put in a request...
I released version 0.3.0, let me know what you think! Any other problems or questions are welcome. I was also thinking about some usability improvements on the tree APIs have the where K: Borrow<Q> + AsBytes, Q: AsBytes
bounds, if either of you had thoughts about that
I have a library that needs to index Prometheus like metrics (e.g. call_latency(instance="101", region="us-east1", env="staging"} Each metric is indexed by name as well as labels (region="us-east1"), with each entry being a roaring bitmap of the metric ids containing the label.
One reason for looking at an ART is because of path compression, since there is great redundancy in label names. The proposed schema is "${label}_${value}" -> bitmap # all metrics with the key value pair "${label} -> bitmap # all metrics with a given label, irrespective of value
I've tried to implement this in BLART but it seems like this is not supported as I get Prefix errors when adding "region" then "region=us-east1" (or the inverse).
My assumption was that it functioned as a trie, which means my use case would be possible. Is this a valid assumption or am I using the API incorrectly (
try_insert
)