boundary / high-scale-lib

A fork of Cliff Click's High Scale Library. Improved with bug fixes and a real build system.
415 stars 61 forks source link

Typo in get_impl? #6

Open rescrv opened 10 years ago

rescrv commented 10 years ago

I believe there is a typo in get_impl here: https://github.com/boundary/high-scale-lib/blob/master/src/main/java/org/cliffc/high_scale_lib/NonBlockingHashMap.java#L540

The line should instead read K == TOMBSTONE.

You'll note that key is what the user passed in, and users should never try to retrieve a TOMBSTONE. In fact, I think Java's type safety prevents them from even getting a reference to the TOMBSTONE.

This typo can effect the safety and efficiency of the get operation as the hash table is no longer linearizable. A write, that is then marked with a TOMBSTONE and copied to the new table will be set to TOMBSTONE. If the copying and the get race, the copy could see a null and return the null, even though it should instead begin looking in the next table. It's a small race, but it's there.

It's also less efficient to reprobe up to reprobe_limit on larger tables, but what's a few extra cycles among friends ;-).

rescrv commented 10 years ago

Ditto for putIfMatch.

rescrv commented 10 years ago

There are a couple other race conditions as well. If this lib is actively used, I'm happy to report them, but I'd like to avoid typing them up if the effort would be wasted.

moonpolysoft commented 10 years ago

Yes please do, it's in active use in a number of different places.

rescrv commented 10 years ago

Here's the other major "gotcha" cases I found. For reference, my C++ implementation is here and is what we're using in HyperDex now.

The resize method makes a chain of inner tables. Although it's extremely unlikely, it's possible for the recursive putIfMatch call to overrun the stack. I saw this in an application with more threads than cores, where one thread was forced to wait to run. By the time it ran, the other threads had constructed many new tables that the global table had promoted past. These intermediary tables were necessarily filled with tombstones, but the straggler thread would still attempt to resize them using the copy helper. Of course, this copy helper would step down to the next table, and repeat. Eventually it overran the stack. Tuning the table resize rate can significantly decrease the likelihood of this race condition. A more solid fix, that I use in my impl, is to count the resize number at which each inner table was established. Upon entry to the putIfMatch call, I skip ahead to top-most table accessible from the outer hash map. This allows a straggler to always work on a copy of the inner table where it can do useful work, without scanning tables that are definitely fully copied.

I also thought the counter implementation was racy during a resize, but it looks like it's doing the right thing.

rescrv commented 10 years ago

The other issue I forgot about and didn't include was the "clear" call. It doesn't behave well with resizes, especially stacked resizes. I opted to remove it completely.