Custom collector in Deque

ghost commented 6 years ago

In PR #21, @jeehoonkang and @Vtec234 pointed out that it should be possible to specify a custom Collector in the deque constructor (Deque::new).

This is not a high-priority issue at the moment, but let's kick off a discussion - how would we go about implementing such a feature?

ghost commented 6 years ago

Here's one idea off the top of my head.

Custom collectors remind me very much of custom allocators (Alloc trait), custom thread pools (ThreadPool in Rayon), custom panic handlers, etc. There is usually an implicit, global default that covers 99% of use cases, but sometimes you still want to specify a custom one.

We could introduce a GetHandle trait (similar to Alloc trait) with a default implementor (DefaultGetHandle is similar to Heap):

trait GetHandle {
    fn get_handle(&self) -> Handle;
}

struct DefaultGetHandle;

impl DefaultGetHandle {
    fn get_handle(&self) -> Handle {
        epoch::handle()
    }
}

Then we can implement Deque and Stealer like this:

struct Deque<T, GH: GetHandle = DefaultGetHandle> {
    // ...
    collector: GH,
}

impl<T, GH: GetHandle> Deque<T> {
    fn with_collector(get_handle: GetHandle) -> Deque<T> {
        Deque {
            // ...
            get_handle,
        }
    }
}

impl<T, GH: GetHandle> Stealer<T> {
    fn steal(&self) -> Steal<T> {
        let handle = self.deque.get_handle.get_handle();
        let guard = handle.pin();

        // ...
    }
}

Finally, let's create a custom collector and give to a newly constructed Deque:

lazy_static! {
    static ref MY_HANDLE: Collector = Collector::new();
}

thread_local! {
    static MY_HANDLE: Handle = MY_HANDLE.handle();
}

struct MyGetHandle;

impl GetHandle for MyGetHandle {
    #[inline]
    fn get_handle(&self) -> Handle {
        MY_HANDLE.with(|h| h.clone())
    }
}

fn main() {
    let d = Deque::with_collector(MyGetHandle);
}

Vtec234 commented 6 years ago

Such a generic interface is definitely possible and could be a future extension, but what I meant for now is simply allowing a fn Deque::new(c: Collector). This would help if a user wants to, for example, have two deques be independent in terms of memory usage and recollection.

One complication is that if we make Handles non-Send, the Stealers cannot just hold them internally. One thing that would work is to make Collector clone - it is just an Arc after all - and make the Stealers contain the Collector. That is a bit bizarre though, since now they would have to use the Collector directly (not possible with current API) or make a new Handle on each usage - terribly slow. Maybe this is in fact an argument for keeping Send on Handle (option 2 in #20). The other solution is to somehow introduce thread-local per-deque storage, but that seems really complicated. This also relates to the question of whether using a custom allocator absolutely requires manual TLS like in the example above or whether it should be possible to simply do let c = Collector::new(); b = ConcurrentStructure::with_collector(c);.

ghost commented 6 years ago

@Vtec234

Guards are currently preventing Handles from being Send. I think we could still allow Handles to be Send, but with a small adjustment to the current API: we make the function Handle::pin unsafe.

The unsafe contract of unsafe Handle::pin would be: You're allowed to call this method to obtain a Guard, but only if you can guarantee that the handle will not be sent to another thread during the lifetime of the guard.

Note that the global function pin would still be marked as safe because the thread-local HANDLE is private and never gets sent to another thread.

Another option would be to add a lifetime to guards: Guard<'a>. Then pinning would be defined as:

pub fn pin() -> Guard<'static> {
    // ...
}

impl Handle {
    pub fn pin<'a>(&'a self) -> Guard<'a> {
        // ...
    }
}

This way the type system would prevent sending a Handle to another thread if you're holding a guard to it.

In any case, if we make Handles Send again, another complication would be that cloning a Handle wouldn't create multiple references to the same underlying handle (cloning would probably create brand new handles). Then we'd have to rethink the signature for global default_handle() function.

Vtec234 commented 6 years ago

@stjepang: I see. If the only thing preventing Handles from being Send is the existence of Guards, maybe we could have a SendHandle wrapper that contains a Handle inside it and is Send, but only allows usage through the .with(f: Fn(Guard)) API. Do you think that would work? I also like the lifetime idea.

With such a structure present (one that allows effectively sending a Handle), it could then be a member in Stealers and allow for runtime (not type-level) custom Collectors in Deque. One issue is that having it present within the Stealer creates some additional overhead.

ghost commented 6 years ago

I see. If the only thing preventing Handles from being Send is the existence of Guards, maybe we could have a SendHandle wrapper that contains a Handle inside it and is Send, but only allows usage through the .with(f: Fn(Guard)) API. Do you think that would work?

Yes, but the with method would have to accept a Fn(&Guard), otherwise a guard might be moved outside the closure.

One issue is that having it present within the Stealer creates some additional overhead.

That's okay - I don't think it's a big deal having it within the Stealer.

One thing to keep in mind when we talk about concurrent data structures is that they're much more rarely used than single-threaded data structures like Vec, String, or HashMap. It's completely normal to have millions of String instances in a program, but that surely won't happen with concurrent deques (there typically won't be more than a few instances). We shouldn't worry too much about the performance impact of things like constructors, destructors, or per-instance overhead.

Another thing I'm worried about is that giving a single Handle to each Stealer works very well for concurrent deques, but it doesn't for all the other data structures. What about concurrent queues or maps? How do we specify a custom collector for them?

My feeling is still that we should think of custom collectors not as something that is casually set up and torn down like in our unit tests, but instead as something that is typically set up at process startup and torn down at process termination. In that sense, collectors are a lot like allocators - and perhaps we should strive to come up with a similar interface to the Alloc trait?

cc @jeehoonkang

ghost commented 6 years ago

Here's yet another proposition: When pinning, let's access Local through the global COLLECTOR instead of the thread-local HANDLE.

When you do HANDLE.with(|handle| ...), accessing the thread local will call __tls_get_addr. See generated assembly in release mode at this playground link. Internally, __tls_get_addr identifies the current thread to obtain an address pointing at this thread's TLS, and finally offsets into the table to access the handle. For more details, check out this article. By my measurements, accessing a thread-local alone costs 2 ns.

When registering a new Local, let's assign this thread an index, which is the smallest index unused by all currently running threads. For example, the first thread to call pin would get index 0, the second would get index 1, and so on. When a Local is destroyed, we free the index by storing it into a global variable freed: Mutex<Vec<usize>>, so that another thread can reuse it.

Now let's have a table inside Global, which maps indices to Locals.

struct Table {
    table: Box<[AtomicPtr<Local>; 1000]>,
    next: AtomicPtr<Table>,
}

When registering a new Local, we also store into the table a pointer to the new Local at the current thread's index. Note that the table is a linked list (consisting of blocks of 1000 elements) so that we can store more than 1000 elements in it.

Now, to pin the current thread, instead of accessing the thread-local HANDLE, let's do the following. First, we access the thread-local INDEX to get the current thread's index (it is lazily initialized). Then, we access the table in the global COLLECTOR and obtain the Local that is stored at current thread's index. Finally, we pin the obtained Local.

This idea seems like it would be too slow, but it's not! I've actually managed to code an ugly prototype that matches the pinning performance of our current implementation (both are 12 ns).

The main benefit of this approach is that we don't need Handles anymore. The whole Collector interface can look like this:

struct Collector { ... }

impl Collector {
    fn new() -> Self;

    fn pin(&self) -> Guard;

    fn is_pinned(&self) -> bool;
}

impl Clone for Collector { ... }

To pin the current thread, you can just do collector.pin(). :) Using this method, specifying a custom collector for each data structure would be trivial.

jeehoonkang commented 6 years ago

I very much like @stjepang's handless proposal! But I'm concerned with portability. Can we call __tls_get_addr in stable Rust?

ghost commented 6 years ago

@jeehoonkang

__tls_get_addr comes from glibc. Other platforms have different, but similar implementations of TLS.

We don't have to call __tls_get_addr directly. We only need a function that returns the current thread ID as a number of type usize. To do that, we'd just have to call thread_id::get() from thread-id crate, which works on all Unixes, Windows, and Redox.

Sorry for the vague explanation. I haven't exactly solidified the whole idea, but would like to implement a prototype in several few weeks so that we can experiment with it. But first I'd like to push forward with crossbeam-deque and crossbeam-channel, and then come back to custom collectors.

crossbeam-rs / rfcs

Custom collector in Deque #23