ArcCell not lock-free - Githubissues

jethrogb commented 6 years ago

The current implementation of ArcCell is not lock-free. It uses a spin lock. I very much want to use this functionality, but I don't think this implementation should be included in crossbeam right now (and frankly, the fact that it's included right now is quite misleading).

ghost commented 6 years ago

You're right. We should probably implement ArcCell<T> using differential reference counting, although that would also require implementing a custom Arc<T> type (the one from the standard library wouldn't be compatible).

Do you perhaps have an example of how you want to use ArcCell<T>? I'm asking to see what exactly you need so that we can accommodate your use case as well as possible.

jethrogb commented 6 years ago

I've needed something like this over and over again to store program-global configuration. gets are frequent and sets extremely rare. For example https://docs.rs/log4rs/0.7.0/src/log4rs/lib.rs.html#446-449

Alternatively, I think you can easily implement this using 128-bit atomics on platforms that support them. But I don't think those are available on stable Rust.

Vtec234 commented 6 years ago

I've attempted a partially wait-free version of ArcCell in #81, but note that it is currently slightly incorrect. Changing the faulty ordering (see PR thread) should fix it, so you could try it out.

That said, crossbeam is currently ongoing a major reorganisation and new types with similar functionality might (and will if people need them) appear in the future.

ghost commented 6 years ago

@Vtec234 I'm slightly worried that wait-free get() operations might starve set() operations forever. Wouldn't it be better to simply use parking_lot::RwLock<Arc<T>> instead?

@jethrogb Actually, I believe a primitive tentatively called HazardBox<T> might be a better fit for your use case. But it doesn't exist yet. :) The idea is to protect the shared object by thread-local hazard pointers to avoid cache invalidation (instead of updating a globally shared value on every read).

Vtec234 commented 6 years ago

@stjepang That should work, yes. My implementation is essentialy a very simple rw-lock and I'm sure much better implementations are possible (maybe based on HP, like you said).

vorner commented 6 years ago

Hello

I played with this a bit and I have an implementation that is lock-free (wait-free in the case of the readers) and it shouldn't be possible to starve the writers. It also works with the standard Arc. I could release it as a separate crate, but I think the code would be better cared for in crossbeam somewhere (I'm not sure where exactly, if there's a specific sub-crate where it fits).

So, if you'd like to accept it, I can create a pull request and place it somewhere (and rename it to ArcCell, to keep the consistency with the current implementation).

What do you think about it?

http://github.com/vorner/arc-swap

ghost commented 6 years ago

@vorner Hey, that looks neat! Designing an atomic shared pointer (like AtomicReference in Java) in a language without GC is a surprisingly difficult problem, and your approach is quite interesting.

I think the biggest drawback of ArcSwap (and atomic_shared_ptr in C++ for that matter) is that readers are updating the same atomic numbers, thus suffering from contention. In my benchmarks, loading from an uncontended ArcSwap takes 28 ns, but then it gets to 180 ns if another thread is loading from the same ArcSwap at the same time. The numbers would probably be similar with atomic_shared_ptr.

Consider the fact that loading from AtomicReference in Java takes less than 10 ns (it's just a typical load instruction on x86 machines), and our numbers quickly pale in comparison.

The solution is to not update any shared counters when loading from ArcSwap. In order words, loading shouldn't return an Arc<T>, but something like ArcSwapGuard<'a, T> instead, which can be dereferenced into a &T for reading or "upgraded" into a real Arc<T> if the user wishes so. I'm prototyping something similar here, if you're interested. Loading takes around 25 ns, and doesn't degrade at all in presence of concurrent readers. And there is no locking whatsoever - reading and updating is always lock-free.

vorner commented 6 years ago

Reading the hazard-pointer approach, it does look interesting. But I'm a bit nervous about two things ‒ the fact that an update may force a reader to loop (therefore the slow-path operation hurting the hot-path one, possibly making it stuck for a long time/forever with a steady inflow of changes), and that the chains of registries and thread entries can grow to long ones and I don't think they ever shrink back in the code. Anyway it is well possible this will be faster than my approach.

Is there any estimate on when your approach could be ready? I ask because I want to know if it makes sense for me to clean it up, add all the license files, and publish it, or (in case hazard-cell will be ready soon in crossbeam) if it's not worth it.

ghost commented 6 years ago

the fact that an update may force a reader to loop (therefore the slow-path operation hurting the hot-path one, possibly making it stuck for a long time/forever with a steady inflow of changes)

If updates are so frequent that they manage to force reads to spin in a loop, then updates kind of are the hot-path. :) But I see what you're saying, it's a fair concern that readers aren't wait-free.

that the chains of registries and thread entries can grow to long ones and I don't think they ever shrink back in the code.

Yes, the current prototype isn't tackling this problem. It's in the plan to shrink back the linked lists.

Is there any estimate on when your approach could be ready?

There are other things on my schedule at the moment, but I'd say in a few weeks.

I ask because I want to know if it makes sense for me to clean it up, add all the license files, and publish it, or (in case hazard-cell will be ready soon in crossbeam) if it's not worth it.

If you have a use case for ArcSwap, go ahead! In any case, we'll definitely want to benchmark our approaches and see how they compare. I don't expect either one to be a win in all scenarios (for example, ArcSwap will almost surely have faster uncontended updates).

acsearle commented 6 years ago

I'll throw my hat in the ring too. I have a reasonably complete AtomicArc now using differential reference counting, inspired by Folly::AtomicSharedPtr. I suspect it is close to what @stjepang was envisioning in the second comment. It has the same problem of reads degrading under contention, has to use its own reimplementation of Arc, and will only work on x86_64 and (hopefully) AArch64 using their spare pointer bits. On the brighter side, it also provides AtomicWeak, has no thread_local or global structures, and no requirements for the user to call anything outside the std::sync::atomic interface.

https://github.com/acsearle/barc

Might be an interesting comparison point if nothing else.

ghost commented 6 years ago

@acsearle Thanks for sharing -- that looks really good, I'm impressed!

So I've just whipped up a simple set of benchmarks to compare the performance of load operation with barc, with hazard pointer-based Arc, and with mutex-protected Arc. Results:

running 6 tests
test conc_load_barc   ... bench:         141 ns/iter (+/- 28)
test conc_load_hazard ... bench:          19 ns/iter (+/- 1)
test conc_load_mutex  ... bench:          39 ns/iter (+/- 25)
test load_barc        ... bench:          17 ns/iter (+/- 1)
test load_hazard      ... bench:          19 ns/iter (+/- 2)
test load_mutex       ... bench:          13 ns/iter (+/- 2)

I wonder what you think about using spare pointer bits in light of near-future migrations to 56-bit address spaces on x86-64 (and maybe even 64-bit). That seems like an unfortunate portability nightmare. :( There was an interesting discussion about this topic in the LuaJIT repo.

jethrogb commented 6 years ago

@acsearle have you considered an implementation using double-word atomics? You should be able to test with 64-bit atomics on 32-bit while we wait for https://github.com/rust-lang/rust/pull/53514 to land

acsearle commented 6 years ago

Wow, the performance of parking_lot Mutex is phenomenal! Is it enough to just .lock() the mutex, or would the better comparison be with .lock().clone() so we get and destroy a usable object per loop, as returned by AtomicWeightedArc::load() and (I think) hazard's .get()? The disheartening result on contested load is consistent what I saw in my own jury-rigged benchmarks: inherently slow since each load has to win a compare_exchange.

I wasn't aware that losing those bits was imminent (there's a relatively recent talk by Chandler Carruth somewhere where he talks about how LLVM uses them). That's unfortunate. Re other architectures, I have thought about it a little. For both of these questions, it should be fairly easy to experiment since it requires only a platform-specific reimplementation of CountedPtr and CountedNonNull. Another option is using alignment bits, perhaps over-aligning the ArcInner allocation to get more of them (if the Rust allocator supports this yet). But yes, as we lose bits, we have to replenish more frequently and the number of threads that are lock-free degrades until we become a spinlock.

Anyway, my concern (having learned a lot from implementing it) is that this style of implementation, independent of whether the count bits are in the pointer or a second word, can never be competitive on contested loads with implementations that manage to use an atomic read for that purpose, and so is a bit of a dead end :(

jonhoo commented 4 years ago

ArcCell was removed in #301, so I assume this should be closed?

jethrogb commented 4 years ago

From #301:

We should really implement a better version of this primitive

Alternatively, this issue could become the feature request?

acsearle commented 4 years ago

I've put in a draft pull request of an AtomicArc using deferred coalesced reference counting, for discussion, but I seem to have screwed up linking it to this issue. It should be considerably faster than my 2018 attempt; atomic loads are pure reads, it uses guards and scoping to prevent temporaries from needing to manipulate the counts at all, and it uses a thread-local ledger to coalesce and (often cancel out) the pending increments and decrements before finally applying the increments at the end of the guard's life and deferring the decrements to the collector.

jethrogb commented 4 years ago

https://github.com/crossbeam-rs/crossbeam/pull/564

crossbeam-rs / crossbeam

ArcCell not lock-free #160