Tessil / hat-trie

C++ implementation of a fast and memory efficient HAT-trie
MIT License
795 stars 114 forks source link

Question: efficient set intersect of multiple HAT-tries #25

Open bigerl opened 4 years ago

bigerl commented 4 years ago

This is more a question, than a real issue.

I am wondering, what is the most efficient way to intersect the keys of two or more hat-tries.

My naive approach would be to iterate over the keys of the HAT-trie with the smallest size and probe the others for the key. e.g.:

std::set<std::string> intersect(std::vector<tsl::htrie_set<char>> operands) {
    // Argument shouldn't be copied or changed. Only for demonstration purposes.
    sort(operands.begin(), operands.end(), 
        [&](const auto & a, const auto & b) { 
            return a.size() < b.size(); 
    });
    auto iterated_operand = operands[0];
    operands.erase(0);
    std::set<std::string> result{};
    for(auto it = iterated_operand.begin(); it != iterated_operand.end(); ++it) {
        bool skip = false;
        for(const auto &probe_operand : operands)
            if (not probe_operand.count(it.key())){
                skip = true;
                break;
            }
        if (not skip) 
            result.insert(it.key());
    }
    return result;
}
Tessil commented 4 years ago

Hi,

It would be possible to optimize this by skipping whole prefix branches. If you currently are on a branch where every key has the prefix 'foo' and the another trie has no key with this prefix, you can then skip the whole branch immediately.

Unfortunately the current interface doesn't provide enough access to the internal of the iterator to do so. I have to check how to properly provide a more flexible iterator

Thibaut

bigerl commented 4 years ago

The approach sounds interesting. With the prefix-skipping it would be very fast for a bunch of cases. Actually, it allows to calculate the empty cut of two sets without common prefixes in O(1). Do you have any plans to implement such an extended iterator or optimized set intersection in the near future?

Tessil commented 4 years ago

I plan to extend the API of the iterator but it may take time before I start to work on it, 2 to 3 months, as I'm currently on holiday and I'll be a bit busy when I get back home.