KaiserKarel / set-trie

Fast subset and superset queries based on tries.
MIT License
11 stars 2 forks source link

Idea: Speed up some queries significantly by maintaining copies of the trie with different orderings over keys? #8

Open makoConstruct opened 1 year ago

makoConstruct commented 1 year ago

For superset queries of small sets with keys that are high in the ordering, I notice that the algorithm has to explore every branch of the trie that's lower (than the lowest key) in the ordering, which in that case is most of the branches. So I'm wondering if it would be worth maintaining three copies of the set trie internally, each one uses a different hashing of the keys (which each have a different, random order). A key that's high in the natural ordering is unlikely to be high in both of the other orderings as well. The smallest of three random values will usually be below average, usually it will be quite small.

Each key has three hashings. The new algorithm would only search in the trie corresponding to the lowest of the key's hashings.

I wrote a little deno script to find out how much of an impact this could make. Here's its output. And like, once you have 6 copies (6 hashings), 84% of keys will have a minimum hashing that's lower than MAX_KEY/4 (in the lowest 25% of all keys).

The impact of keeping multiple copies for avoiding being high is much greater, with 9 hashings, none of the 500 trials lacked a hash that was lower than MAX_KEY/2. This feels like an off by one bug??

Regardless, that's my thought budget for this expended for the day. I might come back and implement this and think more about whether this would be practical, but I might not. (It'll depend on whether it turns out that I can really use set tries in what I'm doing.)

makoConstruct commented 1 year ago

If someone does decide to do this, remember to give each leaf node (which represents one set) links to the leaf nodes of its derangements (of its set) in the other copies. From a leaf node, these pointers can be used to facilitate efficient removals.

It just occurred to me that the gains I modeled above only apply to queries over sets of a single element (a type of query that's not especially well met by set tries to begin with), and as soon as your query set has more than one element, you're now dealing with the lowest of the largest key, which... makes it rare to have especially high first elements in a query set, for any hashing. This makes the technique much less interesting, to the extent that I wouldn't strongly recommend trying this and the maintainer is welcome to close this issue if they don't see potential here either.