Tessil / hat-trie

C++ implementation of a fast and memory efficient HAT-trie
MIT License
792 stars 114 forks source link

[feature request] htrie_map keep values' order #12

Closed tqtifnypmb closed 6 years ago

tqtifnypmb commented 6 years ago

There's a use cases that all values are in specific order. And I wan to extract values by prefix , while keep values in correct order.

Tessil commented 6 years ago

Hello,

I'm not sure to understand your request. What is the specific order? You want to retrieve the values matching a prefix in lexicographical order?

tqtifnypmb commented 6 years ago

Hi. Basically I want to retrieve values in the same order I put them. I don't want to sort them again after they're retrieved, since I guess sorting should happen along with retrieving for sake of performance.

It seems that this requirement can be fulfilled by allowing user to specify the underly container of Trie leaf node. But it will certainly hurt performance too.

Tessil commented 6 years ago

Sorry, I'm still a bit confused. You want to extract all the values matching a prefix, but you want them to be sorted by order of insertion?

If this is right, it'll be quite complicate. The HAT-trie is a partially ordered structure (by lexicographical order) so that doing a prefix search is not too slow. You thus need to sort the values in their order of insertion when you retrieve them, you could have a tsl::htrie_map<char, std::size_t> where std:size_t represents a counter for the order of insertion (that you manage). You can then use equal_prefix_range, put them in a vector, and sort them on this counter.

Allowing to specify a custom container like tsl::ordered_map/set in the leaf node could give you a boost in performance, but if the prefix matches items in multiple hash nodes, you'd still need to do some sort (and maintain a counter). It's a quite convoluted solution, I'm not sure the HAT-trie is the best structure, Boost multi-index container may be more suited.

tqtifnypmb commented 6 years ago

Thanks for your time!

To make it clear you can think about word frequency, I want to extract words by prefix and want them to be sorted by frequency. Using tsl::htrie_map<char, std::size_t> or something like that involves a extract-sort procedure which will be too slow to be useful, because the number of words sharing same prefix is usually large.

I agree that general HAT-trie maybe not suit to address this problem, a custom trie maybe more suited

Tessil commented 6 years ago

So you would need something like std::map as leaf node (or boost::flat_map which could be better if you don't perform a lot of modifications). If the results of a prefix spans multiple leafs node (or intermediate nodes with value), you would still need to do some merge sorts of multiple nodes to get all the matches sorted by frequency. It would also become more complicate if you increment the frequency counter during the life of the container as you would have to move the value in the binary tree/array.

As this solution is too specific I will not modify the current HAT-trie implementation, but you could fork and make the modifications. If something is unclear in the code, warn me.

But I wonder if Boost MultiIndex could provide a solution. One index ordered by the std::string and one by the frequency. Otherwise a custom trie for your problem may be the way.