efficient / libcuckoo

A high-performance, concurrent hash table
Other
1.61k stars 275 forks source link

Possible performance improvement (use hash-only comparisons) #82

Closed rayrapetyan closed 7 years ago

rayrapetyan commented 7 years ago

I would like to ask something before suggesting a patch.

There are multiple places where key_eq() is being used. That means libcuckoo always stores a full copy of the original key and performs a full key comparison to find an entry. At the same time, it uses hash and partial hash to speed-up searches.

My question is: could we drop original key (and key_eq()) completely and always use a hash-comparison only? I believe amount of collisions for 64-bit hashes is really insignificant and should drop to zero for 128-bit hashes...

This should significantly improve performance and free-up a massive amount of memory (especially when users are using strings as keys).

What are your thoughts on that? Thanks!

manugoyal commented 7 years ago

Hey Robert, great question! libcuckoo is intended to be a general-purpose hash table, that provides all the functionality you'd see in a normal hash table (like std::unordered_map), but with high performance under concurrent workloads.

To that end, dropping the keys and only storing hashes may not be in the best interest of the table, as it removes users' ability to iterate over key-value pairs and inspect them, and also introduces some degree of probability into the various queries that could be made. These guarantees of success may fluctuate a lot on different workloads, especially since users are allowed to specify their own hash functions that could be less than optimal.

However, there might be ways to achieve what you're looking for without resorting to modifying the table. One option could be to simply hash the string before you insert it into the table, so that you're storing a mapping from hash to value. This would have the fast lookup guarantees you're looking for, as well as the memory savings. Another option could be to check out https://github.com/efficient/cuckoofilter, if you're use-case fits more that of a bloom filter.

Let me know if any of these alternatives sound good. Thanks! -Manu

rayrapetyan commented 7 years ago

Thanks, I'm already using keys like:

 struct BaseHashKey {
    inline bool operator==(const BaseHashKey &other) const {
        return hash == other.hash;
    }
    struct Hasher {
        xxHash operator()(const BaseHashKey &k) const {
            return k.hash;
        }
    };
    xxHash hash;
};

struct FeedIpKey : public BaseHashKey {
    FeedIpKey(const FeedId &fid, const std::string &ip) {
        hash = XXH64(ip.c_str(), ip.size(), fid);
    }
};

in my code.

Yes, inability to restore the original key may be a fatal problem (I can think about only one real-world scenario - dump a table to restore it in some other place where full keys may have some sense, e.g. in SQL DB). Other popular use-cases (e.g. dump table to restore it later in memory) still doesn't require full keys.

But you are right, giving users more freedom (even with a possibility of a very ineffective usage) is better than cut some (though rare) use-cases , I agree. Thanks for your quick response. Closing ticket!

dnbaker commented 7 years ago

If you're not interested in resolving collisions, then a probabilistic data structure would be a more appropriate solution. I've known of unconventional hash tables which only store signatures of the keys, accepting some some fraction of false positives which are specialized for very low memory overhead.

rayrapetyan commented 7 years ago

"signatures of the keys" - means a smaller hash of a bigger hash ?)