declanvk / blart

https://declanvk.github.io/blart/
Apache License 2.0
28 stars 2 forks source link

Expected Usage Question #37

Open ccollie opened 1 month ago

ccollie commented 1 month ago

I have a library that needs to index Prometheus like metrics (e.g. call_latency(instance="101", region="us-east1", env="staging"} Each metric is indexed by name as well as labels (region="us-east1"), with each entry being a roaring bitmap of the metric ids containing the label.

One reason for looking at an ART is because of path compression, since there is great redundancy in label names. The proposed schema is "${label}_${value}" -> bitmap # all metrics with the key value pair "${label} -> bitmap # all metrics with a given label, irrespective of value

I've tried to implement this in BLART but it seems like this is not supported as I get Prefix errors when adding "region" then "region=us-east1" (or the inverse).

My assumption was that it functioned as a trie, which means my use case would be possible. Is this a valid assumption or am I using the API incorrectly (try_insert)

ccollie commented 1 month ago

p.s. I realize i could use the prefix iterator to gather all metrics with a given label. I also wanted a separate key to account for labels with high cardinality values (which admittedly is not a good idea)

Gab-Menezes commented 1 month ago

Hey @ccollie, since ART is a prefix tree when you try to insert a key that is prefix of other this will result in a error during insertion. For example "hell" is prefix of "hello", so when inserting there is not enough bytes to distinguish between this two.

If you know your labels are ascii only use a CString (you can convert from String <-> CString), this allows you to use the insert method, since CString have a null byte at the end it's impossible for them to be prefix of each other. If the strings need to be utf-8, the paper recommends using sort strings ucol_getSortKey, unfortunatly icu4x doesn't implement somthing similar to this, so you need to use the C bindings.

Gab-Menezes commented 1 month ago

The trait tha dictates if a type is safe to not use the try_* methods is NonPrefixBytes. As you can see this trait is implement for types that are sure to not generate prefixes. If you somehow find a way to encode your utf-8 strings you can use BytesMapping to map your keys. Or create a type and implement AsBytes for this type

ccollie commented 1 month ago

Metric names and values are utf-8. I'm hesitant to bring in something like ICU for this though. What do you think of my idea of using prefix searching to aggregate values ? In that case, keys would be like

region=us-east-1 region=us-west-1 environment=staging | production | test etc

ccollie commented 1 month ago

Another thought. I can suffix my keys with a non-utf8 sentinel byte if that makes it function similar to Cstring

Gab-Menezes commented 1 month ago

IDK if you are going to have more keys than this, but it's obvious that this is ascii only so CString would work. But yeah using prefix iterator is a good idea, recently @declanvk reimplemented it and it's even faster, it's not yet published, but it will at some point. Using a sentinel byte is a good idea and to be type safe I would recommend using BytesMapping.

Gab-Menezes commented 1 month ago

Also in the new version we will have a range iterator, we also have a fuzzy iterator, they may help you with your task. Also happy to see someone giving this crate a shot. If you are nightly user you can enable the nightly flag and enjoy a massive speed up, also recommend compile with target-cpu=native to use simd

ccollie commented 1 month ago

Also in the new version we will have a range iterator, we also have a fuzzy iterator, they may help you with your task. Also happy to see someone giving this crate a shot. If you are nightly user you can enable the nightly flag and enjoy a massive speed up, also recommend compile with taget-cpu=native to use simd

Very nice ! I'm also a nightly user, and since it's a metric library, I'm targeting simd as well.

ccollie commented 1 month ago

Last question: is there a cheap way to get a count for entries with a given prefix ? If so I'd like to raise it as a feature request.

declanvk commented 1 month ago

Another thought. I can suffix my keys with a non-utf8 sentinel byte if that makes it function similar to Cstring

Yeah typically if you're going to have variable length keys, you're going to need some sort of suffix value that is not present in any other part of the key. I like your idea of using non-UTF8 sentinel byte, or you could append a \0 if you're confident that character will never be present in your inputs.

This does make the string a sorta subset of UTF-8.

Last question: is there a cheap way to get a count for entries with a given prefix ? If so I'd like to raise it as a feature request.

At the moment no, the internal nodes of the trie don't track "number of leaf node descendants" as a field. The check for number of entries would need to essentially count the leaf nodes. Feel free to cut a separate ticket as a feature request (doesn't need to be too detailed), I'd be happy to discuss that feature more. I think it would be a tradeoff between increasing the size of the trie inner nodes and being able to speed up "num children" functions for the various iterators

it's not yet published, but it will at some point.

I'm working on this this week, I'll try to cut a release quickly!

Gab-Menezes commented 1 month ago

@declanvk was faster than me haha, but yeah we don't keep track of this number only at the root, but with the new implementation doing a count at the prefix iter should be way faster, since it would be basically iterating over a linked list

ccollie commented 1 month ago

Last question: is there a cheap way to get a count for entries with a given prefix ? If so I'd like to raise it as a feature request.

At the moment no, the internal nodes of the trie don't track "number of leaf node descendants" as a field. The check for number of entries would need to essentially count the leaf nodes. Feel free to cut a separate ticket as a feature request (doesn't need to be too detailed), I'd be happy to discuss that feature more. I think it would be a tradeoff between increasing the size of the trie inner nodes and being able to speed up "num children" functions for the various iterators

I wasn't expecting tracking descendents, only that we iterate nodes and sum its children. I'll put in a request...

declanvk commented 1 month ago

I released version 0.3.0, let me know what you think! Any other problems or questions are welcome. I was also thinking about some usability improvements on the tree APIs have the where K: Borrow<Q> + AsBytes, Q: AsBytes bounds, if either of you had thoughts about that