High contention under many reader threads on hot keys

serkanturgut commented 1 year ago

Hi,

In our application, we are observing high spin times inside spinlock. Typical use case that we observe the contention is many threads (64+ threads running on a 32 psychical cores machine with hyper-threading enabled, the contention starts the show up after 16 threads) and all of them contenting on find API trying to acquire the spinlock for the same keys from the hashtable.

We tried adding Exclusive/Shared locking semantics to the spinlock implementation (which we have tested internally and confirmed it helped reducing the spinlock wait times significantly under high concurrency mostly with read-only threads), we'd like to get your opinions on this.

I created a PR for your assessment https://github.com/efficient/libcuckoo/pull/158/commits/a67ae05047f48995fae691603f499a42dbcae561

We are not currently using this change in production (therefore it's not battle-tested so to speak), we'd like to understand the original reasoning behind the single mode spinlock implementation and also the implementation in the PR whether it's not violating assumptions of the hashtable.

Please also note that, there were 2 lines I had to remove from lock_two function for normal_mode_shared specialization as following;

    rehash_lock<kIsLazy>(l1);
    rehash_lock<kIsLazy>(l2);

Because these 2 lines were calling mutating functions downstream even when find API used. To be honest, I couldn't understand why they were exercised even for read path so please advise whether it was ok to remove these lines for read-only find API (If it was intentional for find APIs to make these calls when calling lock_two, can you help us to understand why?)

Note, I ran existing unit tests and stress tests, they seem to be passing. If you think the PR is reasonable, I can add more stress testing focusing on locking mode aspect of different API calls.

Similar issue I found was https://github.com/efficient/libcuckoo/issues/146 but the proposal was different, so I wanted to post this separately.

Thank you in advance, Serkan

thompsonbry commented 1 year ago

This is a big problem for us too. Any feedback on this topic would be appreciated.

manugoyal commented 1 year ago

Hi @serkanturgut and @thompsonbry, thank you for bringing this back up, and for contributing an implementation! I will try to take a closer look over the next few weeks. To answer a few of your questions:

Please also note that, there were 2 lines I had to remove from lock_two function for normal_mode_shared specialization as following;

    rehash_lock<kIsLazy>(l1);
    rehash_lock<kIsLazy>(l2);

Because these 2 lines were calling mutating functions downstream even when find API used. To be honest, I couldn't understand why they were exercised even for read path so please advise whether it was ok to remove these lines for read-only find API (If it was intentional for find APIs to make these calls when calling lock_two, can you help us to understand why?)

The reason we might mutate the table during a find is due to the "lazy rehashing" feature of the table. Meaning when we resize the table due to a too-high load factor, we don't immediately move all the elements in the table from the old array to the new one. Instead, the first acquirer of each lock is responsible for completing the transfer for all buckets on that lock. This helps amortize the cost of resizing across more operations and unlocks the table more quickly.

I guess in theory we could defer migrating a particular lock to only the write operations on the table, but it would complicate the implementation a fair bit, since now the read operations on unmigrated locks would have to look in old_buckets using hashes appropriate to the old buckets. Since this whole resizing thing should only matter when doing a lot of inserts, I think it'd be simpler to migrate the lock regardless of operation type, and maybe have the reader take an exclusive lock if the lock hasn't been migrated yet. But I'm open to trying things out.

Similar issue I found was https://github.com/efficient/libcuckoo/issues/146 but the proposal was different, so I wanted to post this separately.

Indeed this was a similar issue and my only hesitation was allowing users to use arbitrary locking implementations. Since the lock implementation is so important to performance, I was worried about losing the ability to keep track of performance by giving up control of the locking implementation.

I'm definitely open to adding another lock implementation that favors read-mostly workloads. It would be great to see if we can show the new lock improving performance on the universal_benchmark with a sufficiently high percentage of reads.

serkanturgut commented 1 year ago

Thank you for the clarifications @manugoyal

I guess in theory we could defer migrating a particular lock to only the write operations on the table, but it would complicate the implementation a fair bit

Instead, we can perhaps preserve same design but we could probably start the read-only APIs with shared lock and only upgrade the lock to exclusive access if the read path decides a mutation is needed. It should be easy to add a upgrade_to_x_lock() API to my spinlock implementation in the PR. If we can detect the point that mutation will be needed on read path, we can just switch the locking mode. I am not super sure the complexity of this integration though.

It would be great to see if we can show the new lock improving performance on the universal_benchmark with a sufficiently high percentage of reads.

Thank you for the pointer, I'll try this soon and get back with the results.

manugoyal commented 1 year ago

I guess in theory we could defer migrating a particular lock to only the write operations on the table, but it would complicate the implementation a fair bit

Instead, we can perhaps preserve same design but we could probably start the read-only APIs with shared lock and only upgrade the lock to exclusive access if the read path decides a mutation is needed. It should be easy to add a upgrade_to_x_lock() API to my spinlock implementation in the PR. If we can detect the point that mutation will be needed on read path, we can just switch the locking mode. I am not super sure the complexity of this integration though.

Yeah that sounds great. Only thing is I'd worry about deadlock in case multiple readers acquire the lock and then one tries to upgrade. Another approach would be to peek at the lock.is_migrated() boolean and grab an exclusive lock if it's true. We probably don't need to worry about downgrading the lock after migration or if it was a stale read, since it should be a one-off thing.

efficient / libcuckoo

High contention under many reader threads on hot keys #159