crossbeam-rs / rfcs

RFCs for changes to Crossbeam
Apache License 2.0
147 stars 14 forks source link

Lock-free skip list map/set #27

Closed ghost closed 6 years ago

ghost commented 6 years ago

This is a proposal to introduce a new crate named crossbeam-skiplist, which would provide a concurrent map and a concurrent set similar to BTreeMap and BTreeSet.

Rendered Implementation

cc @withoutboats

Vtec234 commented 6 years ago

Relevant discussion: crossbeam-rs/crossbeam#109

Amanieu commented 6 years ago

I really like this! A concurrent ordered collection would be great to have!

I would suggest generalizing the seek function into lower_bound and upper_bound function. This uses the same Bound parameter as range, which allows you to specify whether you want an inclusive or exclusive bound (>= vs >).

Minor bikeshed: the Entry type should probably be renamed to Cursor since it matches the API of a cursor more than that of the standard Entry type.

Finally, it is not clear whether a SkipMap<T> requires T: Ord or not. Some methods are supported without Ord, but you can't do anything useful with such a SkipMap since there's no way to insert an element.

Amanieu commented 6 years ago

Is the reference counting absolutely necessary? I suspect this may have a more noticeable impact on ARM platforms where reference count updates involve memory barrier instructions. It would be ideal if read-only access to the SkipMap could be done without any barriers at all (except consume/acquire when dereferencing pointers).

withoutboats commented 6 years ago

I'm very excited about this!

The skiplist I wrote deadlocks and I haven't been able to investigate why. Assuming its just some bug in my implementation, I think there's an interesting trade off: my skiplist does not support remove, allowing it to have no reliance of ref counting or the crossbeam epoch.

Since it currently doesn't work I haven't benchmarked it, but if it turns out to be significantly faster, it would be great to see crossbeam support both the map/set proposed by this RFC and a monotonic version, so that users who don't need the ability to remove or update keys without synchronization can have better performance.

ghost commented 6 years ago

@Amanieu

I would suggest generalizing the seek function into lower_bound and upper_bound function.

Sounds good! And yes, we should also add the range method.

Minor bikeshed: the Entry type should probably be renamed to Cursor since it matches the API of a cursor more than that of the standard Entry type.

Maybe. I think of a Cursor as a pointer that may point to an entry in the map (i.e. cursors are nullable). On the other hand, an Entry sounds to me like a concrete entry in the map (not nullable).

Initially I used nullable Cursors in SkipMap just like you did in the intrusive_collections crate, but unwrapping those Option<&K> and Option<&V> coming from methods key and value got old fast. :( In the current Entry API, methods key and value return &K and &V, which is often more ergonomic and robust.

Perhaps we should have both Entry and Cursor, with an API like the following?

impl<'a, K, V> Cursor<'a, K, V> {
    fn is_null() -> bool;
    fn unwrap(self) -> Entry<'a, K, V>;
    fn into_entry(self) -> Option<Entry<'a, K, V>>;

    fn key(&self) -> Option<&K>;
    fn value(&self) -> Option<&V>;

    fn next(&mut self);
    fn prev(&mut self);
}

impl<'a, K, V> Entry<'a, K, V> {
    fn into_cursor(self) -> Cursor<'a, K, V>;

    fn key(&self) -> &K;
    fn value(&self) -> &V;

    fn get_next(&self) -> Cursor<'a, K, V>;
    fn get_prev(&self) -> Cursor<'a, K, V>;
}

Finally, it is not clear whether a SkipMap requires T: Ord or not. Some methods are supported without Ord, but you can't do anything useful with such a SkipMap since there's no way to insert an element.

Inserting anything into the map requires T: Ord, but methods like is_empty and len don't require the bound. The same is true for RBTree in intrusive_collections and BTreeMap, no?

Is the reference counting absolutely necessary? I suspect this may have a more noticeable impact on ARM platforms where reference count updates involve memory barrier instructions. It would be ideal if read-only access to the SkipMap could be done without any barriers at all (except consume/acquire when dereferencing pointers).

Absolutely necessary? No. But is it a good default? I think yes.

Methods like insert, get, and remove already perform a bunch of atomic operations, so an additional refcount increment/decrement for the returned Entry shouldn't matter much. But pinning and refcounting during iteration definitely has an effect on performance, as is mentioned in the RFC. The difference is 113 ms (with pinning and refcounting) versus 95 ms (without pinning and refcounting). The overhead is there, and on ARM machines it may be even larger. Let's assume that this difference is important for your use case.

In order to minimize the performance penalty of pinning and refcounting, we might add methods like the following:

impl<'a, K, V> Entry<'a, K, V> {
    // A single call to `scan_forward_while` or `scan_backward_while` pins the current
    // thread and updates refcounts every 100 steps only (or some other number).
    fn scan_forward_while(&self, f: impl FnMut(&K, &V) -> bool) -> Option<Entry<'a, K, V>>;
    fn scan_backward_while(&self, f: impl FnMut(&K, &V) -> bool) -> Option<Entry<'a, K, V>>;
}

The caveat is that calling the passed function f must not take too long, or else the current thread might be pinned for an unacceptably long time and thus stall epoch advancement. I'm not opposed to adding these methods, or something similar to them.

Refcounting is nice because it's easy to use and you really can't get it wrong. There is surely room for an advanced API providing faster iteration, and we can always add it later. I'm not yet sure what it should look like exactly - it's a tough nut to crack. Does that sound like a reasonable plan? Any other major problems with refcounting I'm missing?

I should also add - if you need really fast iteration, then perhaps skip list is the wrong data structure to begin with? Switching to a B-tree or a radix tree should offer much larger gains in performance than what can be achieved by avoiding pinning and reference counting in a skip list. In the future, Crossbeam should provide a variety of different kinds of maps and sets to cover situations like these.

Finally, note that there is a possibility to make iteration faster in the future by relying on hazard pointers rather than the current combination of pinning and refcounting. The idea is that instead of pinning the current thread and incrementing the refcount before moving to the next node we just update an internal hazard pointer (load + store + fence + load + store, I believe) Right now crossbeam-hazard doesn't exist so pinning and refcounting is the best we got.

@withoutboats

my skiplist does not support remove, allowing it to have no reliance of ref counting or the crossbeam epoch.

Not only that, methods like insert and get need to do fewer checks if there is no possibility of concurrent removes. :) While I don't have any hard numbers, my intuition is that by forbiding removal we can make insert/get/iteration 10-30% faster.

Any suggestions how we might expose both a skip list that allows and a skip list that forbids removal? How about the following?

trait RemovePolicy {}

struct RemoveAllow;
impl RemovePolicy for RemoveAllow {}
struct RemoveForbid;
impl RemovePolicy for RemoveForbid {}

struct SkipMap<K, V, RP: RemovePolicy = RemoveAllow> {
    // ...
}

impl<K, V, RP: RemovePolicy> SkipMap<K, V, RP> {
    // ...
}

impl<K, V> SkipMap<K, V, RemoveAllow> {
    fn remove(&self, k: &K) -> Option<Entry<K, V>> {
        // ...
    }
}

Or should we rely on const generics once they get implemented?

enum RemovePolicy {
    Allow,
    Forbid,
}

struct SkipMap<K, V, const RP: RemovePolicy: RemovePolicy::Allow> {
    // ...
}

Or just provide a completely different type for the no-remove skip list, perhaps GrowSkipMap<K, V>?

withoutboats commented 6 years ago

Or just provide a completely different type for the no-remove skip list, perhaps GrowSkipMap<K, V>?

I think this is the better choice, similar to how there's currently MsQueue and SegQueue.

Also its worth remember that the grow-only types are really only unsynchronized grow-only. They can support remove and get_mut and so on with &mut self methods. (So you could build the map in parallel and then pull items from it on a single reducer thread.)

Amanieu commented 6 years ago

Perhaps we should have both Entry and Cursor, with an API like the following?

No, this would duplicate the API too much. I'm OK with the current non-nullable entry API, I was just commenting on the naming.

The reason I used nullable cursors in intrusive-collections is because CursorMut::next can only be implemented by modifying the cursor in-place (returning a new CursorMut would tie the lifetime of the new cursor to the existing one, which is very inconvenient in practice). You don't have this problem since you don't have a CursorMut type.

Inserting anything into the map requires T: Ord, but methods like is_empty and len don't require the bound. The same is true for RBTree in intrusive_collections and BTreeMap, no?

This is not quite the case for RBTree: you can use it with non-Ord types, you just need to insert elements using CursorMut::insert_before and CursorMut::insert_after. The tree guarantees that it will preserve the order of the inserted elements. Ord is only needed when an actual binary search is required (find and insert without a predefined cursor position). Of course, if you insert elements "manually" then you are responsible for ensuring that the element ordering follows Ord (it's not unsafe if you don't, but the binary search will return garbage).

In order to minimize the performance penalty of pinning and refcounting, we might add methods like the following:

Actually what I had in mind was to simply have Entry hold a Guard, which will guarantee that elements won't disappear while the Entry exists. I understand that this can cause issues when Entry object are left alive for long periods, but this is what would work best for my use case since I will be holding Guards for relatively long periods at a time. Maybe have an API where you provide your own Guard to avoid ref-counting?

I should also add - if you need really fast iteration, then perhaps skip list is the wrong data structure to begin with? Switching to a B-tree or a radix tree should offer much larger gains in performance than what can be achieved by avoiding pinning and reference counting in a skip list. In the future, Crossbeam should provide a variety of different kinds of maps and sets to cover situations like these.

Yes definitely! I think that there are many use cases which are read-heavy and rarely modify data structures. In such cases I would even accept a data structure that requires a lock for writing if it can allow lock-free reads. I currently use a hash table with such a design (the implementation is very specialized though, and probably not suitable for crossbeam).

Vtec234 commented 6 years ago

We might want to tweak the API, bikeshed names, add new methods, etc. Let's discuss those less important matters in the new repo's issue tracker

I wouldn't call deciding on the API a less important matter, because the functionality that we want to support effectively decides the implementation and limits the performance, but I agree that we should decide on that in structure-specific threads.

For example, I am not so enthusiastic about reference-counting by default in the map that I would want, but I see that for other use cases it might be useful. The discussion here seems to be going in the direction of fairly detailed API design, with different people expressing the different requirements that they have.

In my opinion it would be better to provide multiple versions of the structure, either using different types or through const generics as you mentioned. These versions could include one which is a drop-in replacement for BTreeMap including the Entry API and the (albeit apparently small) performance penalties that come with it, as well as one which uses only the minimum of synchronisation required to provide a concurrent implementation of an abstract dictionary, having only a function that returns Clones of values.

I propose to make this RFC a kind of meta-design RFC instead. For that purpose, the API snippet as well as the detailed design section should be removed, preserving just sections describing the motivation and overall idea of a concurrent skiplist. Based on such an RFC we could create a repository like crossbeam-skiplist, and then decide on specific implementations in subsequent RFCs. The reason I would like to do it in this order is that the overall organisation of crates is much less mutable than the design details, so it should be done first.

Related is the question of whether we want a crate like crossbeam-skiplist with Sets and Maps based on a skiplist or crates like crossbeam-map putting together all maps, crossbeam-set with sets, etc. Basically, whether to organise based on implementations or on interfaces. Personally I prefer the former, since crossbeam-map would have to depend both on a skiplist and a hash array, which are quite different, and we can expose structures in a neat way based on the interface through crossbeam.

What do you think?

ghost commented 6 years ago

@Amanieu

This is not quite the case for RBTree: you can use it with non-Ord types, you just need to insert elements using CursorMut::insert_before and CursorMut::insert_after

I missed that - sounds interesting! Well, I don't see a promising way to implement insert_before and insert_after in SkipMap without requiring T: Ord. The problem is that the state of an in-progress insert operation could be messed up by another concurrent insert or remove. In such cases, we have to restart the search (searching requires T: Ord) and try inserting again.

@Vtec234

I wouldn't call deciding on the API a less important matter, because the functionality that we want to support effectively decides the implementation and limits the performance, but I agree that we should decide on that in structure-specific threads.

Agreed - if there are any concerning problems in the current design that might preclude us from harnessing more performance or extending the API in the future, we should try to resolve them now.

In my opinion it would be better to provide multiple versions of the structure, either using different types or through const generics as you mentioned.

Agreed. So if I understood correctly, you're a fan of Clone-based API. The idea is that one can always use Arc<V> in place of V, thus getting the benefits of reference counting by simply cloning those Arc<V>s. There's the downside of Arc adding one more level of indirection, but otherwise seems like a reasonable idea.

I believe there shouldn't be much contention between my refcounting, Amanieu's pinning, and your cloning. We can make everyone happy at the same time. :) Here's how we might do that (not a real proposal, just a quick idea):

impl<K: Ord, V> SkipMap<K, V> {
    fn insert(&self, k: K, v: V);
    fn insert_and_get(&self, k: K, v: V) -> Entry<K, V>;
    fn insert_and_pin(&self, k: K, v: V) -> Guard<K, V>;

    fn get(&self, k: &K) -> Option<Entry<K, V>>;
    fn get_clone(&self, k: &K) -> Option<V> where V: Clone;
    fn get_pin(&self, k: &K) -> Option<Guard<K, V>>;

    fn remove(&self, k: &K) -> Option<Entry<K, V>>;
    fn remove_clone(&self, k: &K) -> Option<V> where V: Clone;
    fn get_pin(&self, k: &K) -> Option<Guard<K, V>>;

    fn pin(&self) -> PinnedSkipMap<K, V>;
}

impl<'a, K: Ord, V> PinnedSkipMap<'a, K, V> {
    fn insert(&self, k: K, v: V) -> &V;
    fn get(&self, k: &K) -> Option<&V>;
    fn remove(&self, k: &K) -> Option<&V>;

    fn repin(&mut self);
}

There's a problem with the current design in that how it manages the life of values in the map. The way it works now is - before reading the value in a node, you must increment the refcount (and incrementing is possible only if the count is non-zero). I should abandon this mechanism and make it possible to read values even if the refcount is zero. That means values would be dropped by the epoch GC rather than by refcounting.

Note that some refcounting is still necessary during insert and remove - we have to count in how many levels is each node installed into the skip list. Without that, a broken Ord implementation might cause memory unsafety. But keep in mind that this kind of refcounting has a negligible cost during insertion and removal, while searching and iteration are not affected by it at all. We could eliminate even that cost by introducing TrustedOrd trait, but that's probably pushing it too far for questionable benefits... :)

Ok, so if we get rid of pervasive refcounting when reading values and expand the API by accommodating pinning and cloning, would that be it? Any other concerns with the current design?

I propose to make this RFC a kind of meta-design RFC instead.

Yes - that's exactly how I think about the RFC, too. Tomorrow I'll add to the text the fruits of this discussion and try to reword it to make the scope of the RFC clearer.

Related is the question of whether we want a crate like crossbeam-skiplist with Sets and Maps based on a skiplist or a crate like crossbeam-map putting together all maps and crossbeam-set, putting together sets.

I'm for grouping by data structures rather than interfaces. Agreed with your reasoning.

jeehoonkang commented 6 years ago

Finally, note that there is a possibility to make iteration faster in the future by relying on hazard pointers rather than the current combination of pinning and refcounting. The idea is that instead of pinning the current thread and incrementing the refcount before moving to the next node we just update an internal hazard pointer (load + store + fence + load + store, I believe) Right now crossbeam-hazard doesn't exist so pinning and refcounting is the best we got.

FYI, I'm working on a proof-of-concept implementation of EBR + hazards: https://github.com/jeehoonkang/crossbeam-epoch/tree/hazard-pointer


I agree with @Vtec234 on characterizing this RFC: let's regard is as a meta one. But I think we can start crossbeam-skiplist with @stjepang's implementation. Let's regard it as a tentative one.

Vtec234 commented 6 years ago

Note that some refcounting is still necessary during insert and remove - we have to count in how many levels is each node installed into the skip list.

This "refcount" is just the dynamic height of the tower, right? Although it has been some time since I wrote it, I vaguely recall that in my own implementation it was enough to have a value for the tower height which is constant across its lifetime, and enforce the order of unlinking tower levels to go from top to the bottom. Then, we know that if the bottom level is unlinked (marked with tag 1), the whole node has been unlinked and so it can be removed, no refcount needed.

before reading the value in a node, you must increment the refcount (and incrementing is possible only if the count is non-zero). I should abandon this mechanism and make it possible to read values even if the refcount is zero. That means values would be dropped by the epoch GC rather than by refcounting.

Is my understanding correct that this would eliminate the overhead of refcounting almost completely while still enabling the Entry API? If so, this would be a really nice implementation (although it's very nice either way :)

I believe there shouldn't be much contention between my refcounting, Amanieu's pinning, and your cloning. We can make everyone happy at the same time. :)

I have to admit I don't like the idea of having three different ways of accessing elements in one structure - it seems like a mess that would lead to lots of confusion. So for now..

But I think we can start crossbeam-skiplist with @stjepang's implementation. Let's regard it as a tentative one.

Agreed. Let's first publish this implementation, providing a drop-in API for BTreeMap. If it turns out that the refcount overhead is unavoidable, we could then provide another struct implementing the Clone and with_element(f: Fn(&V)) API without refcounting. Thoughts?

ghost commented 6 years ago

This "refcount" is just the dynamic height of the tower, right? Although it has been some time since I wrote it, I vaguely recall that in my own implementation it was enough to have a value for the tower height which is constant across its lifetime, and enforce the order of unlinking tower levels to go from top to the bottom. Then, we know that if the bottom level is unlinked (marked with tag 1), the whole node has been unlinked and so it can be removed, no refcount needed.

I'm not sure I understand. That sounds correct... provided that the Ord implementation is correct.

I'll try to elaborate. There are two numbers associated with each node. The first is height, which is always constant. The second is refcount, which is the number of levels the node is linked in + the number of Entrys (i.e. cursors) pointing to the node.

In order to remove a node, we have to go through two stages: marking pointers in its tower and then unlinking the node in each level. Both stages go from the top level to the bottom level (level 0). Only after the node has been unlinked from each level we can add it to the GC.

When unlinking a node at a specific level, we perform a CAS in the tower of the predecessor to change its pointer to point to the successor. If the CAS fails, we have to search again for the node that is being removed so that the search function unlinks the node along the way. While moving through a level, the search function unlinks every encountered node with a marked pointer at the current level.

Now, the problem arises from the fact that the Ord implementation might be incorrect. When you call the search function to unlink the current node, you cannot really know if the node was unlinked or not. Consider: the search function might result with "the node was not found", to which you might say "great, that means we can finally add the node to the GC". But! - the search function might have lied to you since the Ord implementation could be incorrect. We cannot trust it! :)

The only way to really know whether a node is linked into the skip list at a some level is by having the search function do something after unlinking a node. When the search function unlinks a node, it should either mark the pointer at the current level in the unlinked node, or decrement the node's refcount, or do something else. But in any case, it has to do something - some atomic value in the unlinked node has to be updated. In my implementation, that is what refcount is for. When the refcount reaches zero, that means the node has been unlinked from all levels and there are no Entrys pointing to it.

Is my understanding correct that this would eliminate the overhead of refcounting almost completely while still enabling the Entry API?

Yes, it would. :)

You can think of refcounting as some minimal, but necessary bookkeeping required to ensure memory safety (because we cannot trust Ord). Incidentally, this refcounting mechanism can also used for keeping track of the number of Entrys, so it's just a neat trick that lets us have one counter for both purposes.

If it turns out that the refcount overhead is unavoidable, we could then provide another struct implementing the Clone and with_element(f: Fn(&V)) API without refcounting.

Sure, why not! I've just tweaked the destruction mechanism so that you don't have to increment the refcount before reading values anymore. In other words, now we can easily add with_element. I hope SkipList (in base.rs) is now general enough for experimenting with all kinds of APIs. Note that SkipMap and SkipSet are just simple wrappers around SkipList.

jeehoonkang commented 6 years ago

I have a few suggestions and questions on minor details in the implementation:

ghost commented 6 years ago

On the order of fields in Node: Why not key, ref_and_height, value, and then pointers? This way it is guaranteed that key and ref_and_height will be in the same cacheline, if their size is fit.

It's just a matter of tradeoffs - I don't have a particularly convincing argument for the choice of ordering, but it seemed to produce the best results overall during testing. The main argument is: if we want to optimize insert/search/delete operations, then key and pointers should be closer in memory (that's why value is listed first in my ordering). Consider what happens during a search operation: as we traverse the skip list, we're basically only reading pointers to jump from one node to another and comparing keys to decide where to jump next. Having key and pointers close will improve cache locality.

How about just defining MAX_HEIGHT as (1 << HEIGHT_BITS)?

Sure - I'll do that!

Use of checked_add: Practically, is it necessary?

Not really... I just felt too nervous doing unchecked arithmetic.

Why size_in_u64s, not size_in_u32s for 32-bit architectures?

Eh, just didn't bother with that. :) All this funny allocation business is just a workaround until we get Heap.alloc. Hopefully soon.

base.rs#L145: I think the condition should be (!refs_and_height) >> HEIGHT_BITS == 0.

Hmmm, almost? :) I don't understand why the ! operator is used here. But refs_and_height >> HEIGHT_BITS == 0 would work just fine and be equivalent to refs_and_height & !HEIGHT_MASK == 0.

Use of epoch::unprotected(): I think it'd be much helpful if there is a line of comment or two explaining why they are safe to use.

Done.

I don't understand the comments: "TODO(stjepang): Embed a custom crossbeam_epoch::Collector here. If needs_drop::() or needs_drop::(), create a custom collector, otherwise use the default one. Then we can also remove the K: Send + 'static and V: Send + 'static bounds." Would you please explain it a little bit more?

Sure. I've changed the comment to:

// TODO(stjepang): Embed a custom `epoch::Collector` inside `SkipList<K, V>`. Instead of adding
// garbage to the default global collector, we should add it to a local collector tied to the
// particular skip list instance.
//
// Since global collector might destroy garbage arbitrarily late in the future, some skip list
// methods have `K: 'static` and `V: 'static` bounds. But a local collector embedded in the skip
// list would destroy all remaining garbage when the skip list is dropped, so in that case we'd be
// able to remove those bounds on types `K` and `V`.
//
// As a further future optimization, if `!mem::needs_drop::<K>() && !mem::needs_drop::<V>()`
// (neither key nor the value have destructors), there's no point in creating a new local
// collector, so we should simply use the global one.

fn random_height(): it reads from and writes to seed with the Relaxed ordering. Due to weak-memory effect, I think num will not be a good random number. Also, the implementation assumes that MAX_HEIGHT <= 32, right?

Yes. ConcurrentSkipListMap in OpenJDK 7 does the same thing (see [here]), although in version 8 it uses a thread-local random number generator. The imperfection in RNG quality probably doesn't matter, although I'm open to finding a better solution here. Just keep in mind that we really must have a fast custom RNG, otherwise its cost will show up in benchmarks.

base.rs#472: I think it is enough to: (*n).refs_and_height.store(2 << HEIGHT_BITS, Ordering::Relaxed);

That would set the height to 0. The proper store operation would be: (*n).refs_and_height.store((2 << HEIGHT_BITS) | (height - 1), Ordering::Relaxed)

ghost commented 6 years ago

Sorry for the delay, everyone. I've just added a section on alternative interfaces and an explanation that the current map/set API is only tentative. Is there anything still missing from the RFC?

Vtec234 commented 6 years ago

Consider: the search function might result with "the node was not found", to which you might say "great, that means we can finally add the node to the GC". But! - the search function might have lied to you since the Ord implementation could be incorrect. We cannot trust it! :)

I see now what you mean, thanks. In my implementation this was fine, since I just hashed everything, and Ord on usize can be trusted, but it is indeed possible that a really broken Ord fails to unlink a node.

ghost commented 6 years ago

@Amanieu and @jeehoonkang - do you still have any reservations with this RFC or can we merge? :)

jeehoonkang commented 6 years ago

Sorry, I've been busy these days. I'll look into the code once more tomorrow. Sorry!

Amanieu commented 6 years ago

I've been considering the API from the no_std (with alloc for memory allocation) perspective where you don't have a global Collector or any form of thread-local storage (or even thread IDs).

The only suitable API which fits in those constraints is one where the user managers collectors and handles themselves and passes a Guard as a parameter to every skiplist operation.

I understand that this type of API may not be very user-friendly, so I won't complain if you decide to stick with a simpler API. However I will probably end up using a fork of your skiplist which uses a Guard-based API.

ghost commented 6 years ago

@Amanieu I suggest we make crosbeam_skiplist::base a public module with an unopinionated no_std-compatible skip list so that anyone can simply wrap it in whatever interface they need (or just use directly). That probably means every method would take a &Guard.

Let's hash out the details in the new repo's issue tracker.