Open ghost opened 6 years ago
Here's one idea off the top of my head.
Custom collectors remind me very much of custom allocators (Alloc
trait), custom thread pools (ThreadPool
in Rayon), custom panic handlers, etc. There is usually an implicit, global default that covers 99% of use cases, but sometimes you still want to specify a custom one.
We could introduce a GetHandle
trait (similar to Alloc
trait) with a default implementor (DefaultGetHandle
is similar to Heap
):
trait GetHandle {
fn get_handle(&self) -> Handle;
}
struct DefaultGetHandle;
impl DefaultGetHandle {
fn get_handle(&self) -> Handle {
epoch::handle()
}
}
Then we can implement Deque
and Stealer
like this:
struct Deque<T, GH: GetHandle = DefaultGetHandle> {
// ...
collector: GH,
}
impl<T, GH: GetHandle> Deque<T> {
fn with_collector(get_handle: GetHandle) -> Deque<T> {
Deque {
// ...
get_handle,
}
}
}
impl<T, GH: GetHandle> Stealer<T> {
fn steal(&self) -> Steal<T> {
let handle = self.deque.get_handle.get_handle();
let guard = handle.pin();
// ...
}
}
Finally, let's create a custom collector and give to a newly constructed Deque
:
lazy_static! {
static ref MY_HANDLE: Collector = Collector::new();
}
thread_local! {
static MY_HANDLE: Handle = MY_HANDLE.handle();
}
struct MyGetHandle;
impl GetHandle for MyGetHandle {
#[inline]
fn get_handle(&self) -> Handle {
MY_HANDLE.with(|h| h.clone())
}
}
fn main() {
let d = Deque::with_collector(MyGetHandle);
}
Such a generic interface is definitely possible and could be a future extension, but what I meant for now is simply allowing a fn Deque::new(c: Collector)
. This would help if a user wants to, for example, have two deques be independent in terms of memory usage and recollection.
One complication is that if we make Handle
s non-Send, the Stealer
s cannot just hold them internally. One thing that would work is to make Collector
clone
- it is just an Arc
after all - and make the Stealer
s contain the Collector
. That is a bit bizarre though, since now they would have to use the Collector
directly (not possible with current API) or make a new Handle
on each usage - terribly slow. Maybe this is in fact an argument for keeping Send
on Handle
(option 2 in #20). The other solution is to somehow introduce thread-local per-deque storage, but that seems really complicated. This also relates to the question of whether using a custom allocator absolutely requires manual TLS like in the example above or whether it should be possible to simply do let c = Collector::new(); b = ConcurrentStructure::with_collector(c);
.
@Vtec234
Guards are currently preventing Handle
s from being Send
. I think we could still allow Handle
s to be Send
, but with a small adjustment to the current API: we make the function Handle::pin
unsafe.
The unsafe contract of unsafe Handle::pin
would be: You're allowed to call this method to obtain a Guard
, but only if you can guarantee that the handle will not be sent to another thread during the lifetime of the guard.
Note that the global function pin
would still be marked as safe because the thread-local HANDLE
is private and never gets sent to another thread.
Another option would be to add a lifetime to guards: Guard<'a>
. Then pinning would be defined as:
pub fn pin() -> Guard<'static> {
// ...
}
impl Handle {
pub fn pin<'a>(&'a self) -> Guard<'a> {
// ...
}
}
This way the type system would prevent sending a Handle
to another thread if you're holding a guard to it.
In any case, if we make Handle
s Send
again, another complication would be that cloning a Handle
wouldn't create multiple references to the same underlying handle (cloning would probably create brand new handles). Then we'd have to rethink the signature for global default_handle()
function.
@stjepang:
I see. If the only thing preventing Handle
s from being Send
is the existence of Guard
s, maybe we could have a SendHandle
wrapper that contains a Handle
inside it and is Send
, but only allows usage through the .with(f: Fn(Guard))
API. Do you think that would work? I also like the lifetime idea.
With such a structure present (one that allows effectively sending a Handle
), it could then be a member in Stealer
s and allow for runtime (not type-level) custom Collector
s in Deque
. One issue is that having it present within the Stealer
creates some additional overhead.
I see. If the only thing preventing
Handle
s from beingSend
is the existence ofGuard
s, maybe we could have aSendHandle
wrapper that contains aHandle
inside it and isSend
, but only allows usage through the.with(f: Fn(Guard))
API. Do you think that would work?
Yes, but the with
method would have to accept a Fn(&Guard)
, otherwise a guard might be moved outside the closure.
One issue is that having it present within the
Stealer
creates some additional overhead.
That's okay - I don't think it's a big deal having it within the Stealer
.
One thing to keep in mind when we talk about concurrent data structures is that they're much more rarely used than single-threaded data structures like Vec
, String
, or HashMap
. It's completely normal to have millions of String
instances in a program, but that surely won't happen with concurrent deques (there typically won't be more than a few instances). We shouldn't worry too much about the performance impact of things like constructors, destructors, or per-instance overhead.
Another thing I'm worried about is that giving a single Handle
to each Stealer
works very well for concurrent deques, but it doesn't for all the other data structures. What about concurrent queues or maps? How do we specify a custom collector for them?
My feeling is still that we should think of custom collectors not as something that is casually set up and torn down like in our unit tests, but instead as something that is typically set up at process startup and torn down at process termination. In that sense, collectors are a lot like allocators - and perhaps we should strive to come up with a similar interface to the Alloc
trait?
cc @jeehoonkang
Here's yet another proposition: When pinning, let's access Local
through the global COLLECTOR
instead of the thread-local HANDLE
.
When you do HANDLE.with(|handle| ...)
, accessing the thread local will call __tls_get_addr
. See generated assembly in release mode at this playground link. Internally, __tls_get_addr
identifies the current thread to obtain an address pointing at this thread's TLS, and finally offsets into the table to access the handle. For more details, check out this article. By my measurements, accessing a thread-local alone costs 2 ns.
When registering a new Local
, let's assign this thread an index, which is the smallest index unused by all currently running threads. For example, the first thread to call pin
would get index 0, the second would get index 1, and so on. When a Local
is destroyed, we free the index by storing it into a global variable freed: Mutex<Vec<usize>>
, so that another thread can reuse it.
Now let's have a table inside Global
, which maps indices to Local
s.
struct Table {
table: Box<[AtomicPtr<Local>; 1000]>,
next: AtomicPtr<Table>,
}
When registering a new Local
, we also store into the table a pointer to the new Local
at the current thread's index. Note that the table is a linked list (consisting of blocks of 1000 elements) so that we can store more than 1000 elements in it.
Now, to pin the current thread, instead of accessing the thread-local HANDLE
, let's do the following. First, we access the thread-local INDEX
to get the current thread's index (it is lazily initialized). Then, we access the table in the global COLLECTOR
and obtain the Local
that is stored at current thread's index. Finally, we pin the obtained Local
.
This idea seems like it would be too slow, but it's not! I've actually managed to code an ugly prototype that matches the pinning performance of our current implementation (both are 12 ns).
The main benefit of this approach is that we don't need Handle
s anymore. The whole Collector
interface can look like this:
struct Collector { ... }
impl Collector {
fn new() -> Self;
fn pin(&self) -> Guard;
fn is_pinned(&self) -> bool;
}
impl Clone for Collector { ... }
To pin the current thread, you can just do collector.pin()
. :)
Using this method, specifying a custom collector for each data structure would be trivial.
I very much like @stjepang's handless proposal! But I'm concerned with portability. Can we call __tls_get_addr
in stable Rust?
@jeehoonkang
__tls_get_addr
comes from glibc
. Other platforms have different, but similar implementations of TLS.
We don't have to call __tls_get_addr
directly. We only need a function that returns the current thread ID as a number of type usize
. To do that, we'd just have to call thread_id::get()
from thread-id crate, which works on all Unixes, Windows, and Redox.
Sorry for the vague explanation. I haven't exactly solidified the whole idea, but would like to implement a prototype in several few weeks so that we can experiment with it. But first I'd like to push forward with crossbeam-deque
and crossbeam-channel
, and then come back to custom collectors.
In PR #21, @jeehoonkang and @Vtec234 pointed out that it should be possible to specify a custom
Collector
in the deque constructor (Deque::new
).This is not a high-priority issue at the moment, but let's kick off a discussion - how would we go about implementing such a feature?