efficient / libcuckoo

A high-performance, concurrent hash table
Other
1.62k stars 275 forks source link

able to allow duplicated keys #96

Closed zhouqingqing closed 6 years ago

zhouqingqing commented 6 years ago

Hi,

Do you think it is possible to allow insertion with duplicated keys? This will enable much broader usage. UPSERT might be affected, but I think it is ok to disable it in this case.

Thanks, Qingqing

dnbaker commented 6 years ago

I don’t think that’s how a hash table works, but you could use a container for values, like a vector or set.

zhouqingqing commented 6 years ago

Agree that's not by definition. But if I use a vector/set as the value, I need to have a concurrency protection on them, which is sub-optimal.

Implementation wise, is there any fundamental reason makes it difficult to allow duplicates? Usage wise, we can introduce a template parameter enable/disable the behavior.

manugoyal-nuro commented 6 years ago

I think using a vector/set as the value is probably the easiest thing to do. There is no need for concurrency protection within the vector, because any thread that is doing an insert on that key will have a lock on the key/value pair while it's being modified. So multiple writers to the same key will execute their transactions in sequence.

Native duplicates are hard to support (in the way that std::unordered_multiset might), because we have fixed-sized buckets, so if you were to insert more than say 8 of the same key into the table, we would never be able to find a place to store them because they all hash to the same set of buckets (unless maybe we augmented each key with some sort of sequence id somehow, but that's not trivial to implement).

In any case, I think the only disadvantage to using a vector/set is that you can't perform find/insert to the same key concurrently (one will necessarily execute after the other). But this seems like sensible behavior for most applications, and I think supporting concurrent find/inserts on the same key would require an entirely different table design to truly support.

zhouqingqing commented 6 years ago

Thanks for the answer and sorry not get back earlier.

If I use vector/set as the value, I understand insert() is safe (or upsert() more accurately), but how about find()? Once I am out of find(), there is no lock protection, thus not safe to traverse the vector/set. If there is a function like find_lockheld() meaning the lock is still hold when it is out, then that shall work.

dnbaker commented 6 years ago

find is fine if you're just accessing it as a const std::set<T> &key since it's unmodifiable. You'd have to do atomic operations on the object to make it safe if not by const reference.

zhouqingqing commented 6 years ago

Sorry, not sure if I got it:

Say we have a cuckoohash_map<int, std::vector *>:

How to make thread 2 vector traverse safe? One way is to introduce find_hockheld(), or extends find() to find(functor F) like upsert(F) so I can do traverse with lock hold. The other way is to use some thread-safe vector container to replace std::vector AFAICS.

manugoyal-nuro commented 6 years ago

Hi Qingqing, I think the function you're looking for is find_fn (documented here http://efficient.github.io/libcuckoo/classcuckoohash__map.html#ad04cb720c850e6feda25a4b99fefba28). The function you provide will be run while under a lock, so traversals and modifications to the vector will be thread safe.

zhouqingqing commented 6 years ago

Thanks - surely I overlooked find_fn()!