declanvk / blart

https://declanvk.github.io/blart/
Apache License 2.0
26 stars 2 forks source link

[Feature Request] Method to count entries starting with a prefix #38

Open ccollie opened 1 week ago

ccollie commented 1 week ago

I'd like to be able to get a count of entries starting with a given prefix. Currently its possible iterating through a prefix iterator and counting elements, however we can optimize this by visiting nodes and summing children.

Gab-Menezes commented 1 week ago

I might be wrong but just summing the number o children from each node would give the wrong result, imagine a case where we have N4 (Leaf, N4 (Leaf, Leaf)), summing the number o children of each node would result in 4, but it's clear that the number of children is 3. Adding a new field to header would increase memory consumption, since the header already uses 22 bytes with no padding between fields. So adding a new u32 would mean a 18.18% increase in memory consumption of the header, and also would make insertions/deletions harder, since this value is recursively accumulated up to the root. Also given how the new prefix iterator works, we basically don't iterate over the tree it self, we just find the min and max leafs with the prefix and do a range iterator over it, which is basically a linked list iteration.

Also iterating over the tree it self would probably be slower than over the leafs with the next pointer, since we need to keep a vec as a stack.

declanvk commented 1 week ago

Whoops, I meant to post a comment here a day ago and lost it.


I agree with @Gab-Menezes that the most likely implementation of this would be to add a new field to the header of internal nodes to track the number of leaf node descendants. This would imply some changes to the insert and delete operations, so that the counter would be incremented or decremented on a successful operation.

On the pros side, then we would have the ability to cheaply implement ExactSizeIterator for a larger number of iterators, like for fuzzy_*, prefix_*, or range_*. I think this would be beneficial for cases where:

  1. We want to know the number of entries in a given iterator (this feature request)
  2. May improve performance when collecting these iterators into a new collection (for collections that support pre-allocating with specific amount of capacity, like Vec::with_capacity)
ccollie commented 1 week ago

The other performance related reason is that it allows me to decide on the most efficient way to iterate the collection. In my case, I run matching functions (including regexes) against items matching a prefix.

Above a certain count, it's worth it to switch to optimized predicates, but those require a compilation step which can then be amortized over the length of longer iterators.

Gab-Menezes commented 1 week ago

IDK how I feel about this, this will definitely make insertions/deletions way slower, since right now we just find the writing point and apply the changes. With this we somehow need to find a way to keep track of all the parents while inserting/deleting (basically creating a stack, or doing recursion and both have performance implications) to increment this number if the insertion was successful.

ccollie commented 1 week ago

If this makes general case insertions/deletions measurably slower, I'd be Ok without it.