Open ghost opened 7 years ago
Actually, I was writing on the same topic :) Thank you for writing up!
@stjepang Thanks so much for this comment! It's great to know that this stuff is being actively developed.
@ezrosent will probably have is own comments in addition to mine, but here are mine:
The way I see it, crossbeam is part of a larger class of systems that use a common pattern: there's a global singleton, and then there are thread-local caches for performance. These caches are not strictly necessary for correctness, but only improve performance. These thread-local caches are also singletons. Allocators also fall into this class.
There are a couple of issues that this pattern introduces:
To solve this issue, I've been envisioning the following pattern:
Sync
, but they are Send
, and then implement Clone
.Arc
on this global structure.Other patterns can easily be implemented in terms of this one:
lazy_static
crate). Each thread's TLS holds a handle which is initialized by cloning the global handle. Using this approach requires that the global structure be able to be modified directly in case crate functions are called after a thread's TLS has been destructed. This is best accomplished by having xxx_global
methods on the handles that take an immutable receiver. To give a concrete example, for crossbeam's epoch GC, this might mean being able to append garbage to a global (rather than thread-local) free list if a function is called after a thread's TLS has been destructed.The way I'm envisioning it, the pattern should be so cookie-cutter that you might in theory be able to make a macro that takes one of these handle-based implementations and generates the corresponding global singleton implementation.
To be concrete, for epoch GC, this might look something like:
// global state for a GC
struct GCGlobal<A: Alloc> { ... }
pub struct GC<A: Alloc>{ ... }
impl GC<Heap> {
fn new() -> GC<Heap> { ... }
}
impl<A: Alloc> GC<A> {
fn with_alloc(a: A) -> GC<A> { ... }
fn new_owned<T>(&mut self, t: T) -> Owned<T> { ... }
fn new_owned_global<T>(&self, t: T) -> Owned<T> { ... }
// etc
}
impl<A: Alloc> Clone for GC<A> {
fn clone(&self) -> GC<A> { ... }
}
and then, for the global singleton:
lazy_static!{ static ref GLOBAL_HANDLE: GC<Heap> = GC::new(); }
thread_local!{ static TLS_HANDLE: GC<Heap> = GLOBAL_HANDLE.clone(); }
impl<T> Owned<T> {
fn new(t: T) -> Owned<T> {
match TLS_HANDLE.try_with(|handle| handle.new_owned()) {
Ok(o) => o,
Err(_) => GLOBAL_HANDLE.new_owned_global()
}
}
}
Also, a quick follow-up to that: With respect to your question about avoiding allocations for no-std, you don't actually need to avoid allocations, you just need to ensure that the user can provide an allocator. Even in elfmalloc, we have a simple "bootstrapping" allocator (bsalloc) that we use for internal allocations, and we'd be able to use it with crossbeam if needed.
Also also: I read the crossbeam RFC, and with the exception of dedicated GC threads, I think that everything I've described is orthogonal to the changes that you guys are already considering. So hopefully I'm not coming along and trying to tell you to do your jobs differently :)
Thanks for reaching out, y'all! I haven't checked in on new crossbeam features in a while and I think the work you're doing looks awesome.
I think joshlf@ gave a pretty accurate summary of our current problems. I do think that having a handle-based API for the epoch GC would help us have a cleaner story around thread destruction. It looks like a lot of the work is already done (though I don't see any other implementors of Realm
other than the default global instance in this branch?).
Other than that, it would also be very helpful to have crossbeam structures parametric on an allocator. It seems like being able to create a local instance of the GC system is a precondition for this (there's no good way to override a default type parameter when you're talking to a global singleton). My initial thought on this was that we would have to wait for standard collections like Vec
to gain support for a custom allocator. However, if it is just Owned
and two custom data-structures that we have to worry about, we could potentially migrate this without waiting for the bigger standard crates.
P.S. I'm very happy to see that you can defer functions now! A while ago I made a hash table that required this feature and I just hacked it into crossbeam to do it. Now I can factor it out into it's own crate and depend on a stable branch.
I realized that I'd made a mistake in my original comment: the global static instance should itself be a handle. I've updated the original comment, and I also wrote up a concrete example (that compiles!) of a simple logging library that buffers log lines. Check it out.
Thanks so much for your writing detailed responses. I took some time to digest all this and think about it a little bit.
The code @joshlf has outlined seems pretty reasonable to me. I think we could design realms in a very similar manner. What do you think, @jeehoonkang?
We do want to support the Alloc
API, although gating it under the nightly
feature might be painful. I hope it gets stabilized soon. :)
No problem!
We do want to support the
Alloc
API, although gating it under thenightly
feature might be painful. I hope it gets stabilized soon. :)
No worries - the handle-based design alone would be a huge improvement for us; even without the Alloc
API we'd get a lot of benefit from it.
That said, I will say that I don't see it getting stabilized soon - there are still a lot of details to work out and a lot of other work to be done (e.g., incorporating it into libcollections).
Thank you for the valuable comments! I felt we need use cases of Crossbeam we can target on for improving Crossbeam, and elfmalloc is an excellent one.
I implemented a part of the "handle" proposal in: https://github.com/jeehoonkang/crossbeam-epoch/tree/handle . It doesn't strictly follow the pattern described above, though. Here are my two cents from this experience.
It is possible that handle will degrade the performance by adding an extra indirection to the global data. In the current implementation of Crossbeam, the per-thread handle ("mutator" in the current master
) does not have a reference to the global data. It is directly accessed from lazy_static!
globals. On the other hand, in the handle proposal, a per-thread handle contains a reference to the global data. it will incur the cost of an indirection. That said, I agree that it may be also possible that the indirection actually improve the performance, as accessing a lazy_static!
global is more involved then just loading the data. Also, the code looks much cleaner now. handle.rs
no longer directly references the global data in global.rs
.
I almost abandoned the "realm" proposal, but I think the need for supporting both std and no-std environment can be a strong motivation for it. Note that the realm proposal is almost implemented in the above branch: for early adopters, just defining a global::Global
and making a handle out of it is sufficient. But I think there might be a safety bug in the current implementation. Furthermore, I don't think the current interface is the most easiest one to use for no-std environment. I would like to ask your opinion on the interface of Crossbeam for no-std scenario.
That's fantastic! Good to know that proposal is actually reasonable enough to be implemented :)
It looks like your global pin
implementation unconditionally calls with
. Do you have an idea for how you'll handle the TLS key already having been destroyed? That's been the tricky part for us, and it's what motivated the xxx_global
methods in my proposal. If I'm not mistaken, it looks like this is what your use of a reference in the handles (rather than an Arc
, as my proposal uses) is meant to solve?
@joshlf yes, I noticed the use of try_with
in your example instead of with
, but I couldn't think of a good implementation of Crossbeam with try_with
. We may create a new handle in pin()
when TLS is already destructed, but we need to store it into a thread-local storage for later calls to pin()
and is_pinned()
. This is because pin()
interacts with not only the global data, but also thread-local data that will affect later calls to Crossbeam's interface. For e.g.:
... // TLS destructed
pin(|scope| {
let is_pinned = is_pinned(); // expected to be `true`, but how..?
});
A possible solution that changes the interface a little bit is, instead of providing the global pin()
and is_pinned()
functions, provide get_handle(): &Handle
which returns the handle stored in TLS if not destructed, or creates new handle and returns it. With handle: &Handle
users can call handle.pin()
and handle.is_pinned()
. Provided that the handle is not deallocated while the thread still has the handle, it will be safe. Also I "guess" it could be an acceptable interface..
I'm not familiar with how pinning works, but I was under the impression that one of the main functions of epoch's TLS is to maintain thread-local garbage lists. I was thinking that if a TLS value doesn't exist, garbage could just be appended directly onto the global list. I'm not sure how that'd interact with pinning, though.
@joshlf Yes, I think that part (locally storing garbages and flushing them to the global queue) is already implementable with try_with
, but Crossbeam's interface provides more functions than garbage buffering, and I currently have a good idea how to implement them properly. I'd like to hear from other Crossbeam developers 😄
@joshlf
I'm not familiar with how pinning works, but I was under the impression that one of the main functions of epoch's TLS is to maintain thread-local garbage lists.
Crossbeam's TLS is just the current thread handle. Pinning accesses the handle, marks it as pinned, executes a closure, and finally marks it as upinned.
You can think of a handle as (I'm simplifying a bit):
struct Handle(Arc<Inner>);
struct Inner { list: GarbageList, pinned: Option<Epoch> }
Do you have an idea for how you'll handle the TLS key already having been destroyed? That's been the tricky part for us, and it's what motivated the
xxx_global
methods in my proposal.
I thought in elfmalloc you would never call epoch::pin(|| ...)
, but instead create handles yourself and then call my_handle.pin(|| ...)
. Is this correct?
Note that this way Crossbeam wouldn't touch TLS at all, so you wouldn't have to worry about Crossbeam's TLS getting destroyed first.
I thought in elfmalloc you would never call
epoch::pin(|| ...)
, but instead create handles yourself and then callmy_handle.pin(|| ...)
. Is this correct?
Hmmm I suppose this might work. To clarify, you're saying that in the case that our TLS data has been destroyed (and thus we don't have access to our thread-local epoch handle), we would just clone a new handle from the global instance temporarily in order to use it for whatever work we needed to do?
@joshlf Yeah, that's what we'd suggest doing. If you don't have the thread-local handle anymore, just create a one-off temporary handle so that you can move forward.
Alternatively, I believe we could also implement some kind of anonymous pinning, where you can pin yourself without a handle for cases like these. In fact, I did something like this in C++ long time ago. It might a bit tricky to implement, but it's definitely possible.
I'd be curious to see the performance of the two approaches, but intuitively, I feel like the temporary handle approach is probably fine because that approach would only be used in a slow path that's executed at most a couple of times per thread (often never executed).
@joshlf In order to make sure I understood your proposal, I'd like ask a question: is the following what you want to build in the end?
I think the way elfmalloc is bootstrapped is really neat :smile: Also, the two instances of Crossbeam interact with each other in a very interesting way: a deallocation in the global instance may invoke the methods of the custom instance of Crossbeam via elfalloc. I hope it's okay performance wise, but I'm not really confident with that.
Yep, that looks right!
(no-std) elfalloc is implemented. It depends on bsalloc, and uses a custom instance of Crossbeam that uses bsalloc.
To clarify, you mean something like GC<BSAlloc>
(assuming GC
is the name for an instance of the epoch collector)? If so, yes.
I think the way elfmalloc is bootstrapped is really neat 😄 Also, the two instances of Crossbeam interact with each other in a very interesting way: a deallocation in the global instance may invoke the methods of the custom instance of Crossbeam via elfalloc. I hope it's okay performance wise, but I'm not really confident with that.
Thanks :) I don't think the interaction should be that bad honestly. The snarky answer is "abstraction," but I think that's actually reasonable here - most code (including the std version of Crossbeam) treats the allocator as a black box. So long as that allocator is performant (in speed and memory usage), it shouldn't matter how it's implemented, even if in practice it's implemented with another copy of the logic implemented in Crossbeam. I don't see how it'd be any different than if we had implemented our own custom concurrent collection internally, or used some other performant allocation strategy.
Hi, @joshlf and @ezrosent! Congrats on the first release of elfmalloc! :) This is fantastic - I'm always excited to see new work being created in the domain of concurrency/parallelism using Rust.
I see you're using Crossbeam 0.2. Not sure if you're familiar with the current state of affairs, but we're rewriting the epoch GC and redesigning the API, and hoping to release it soon. I'd suggest looking into this RFCs repository if you're curious to see where we're heading at.
I saw a comment on Reddit saying that, to remove the dependency on std from elfmalloc, crossbeam's epoch-based memory management would need to support no-std. Well, Crossbeam's new epoch-based GC currently allocates in two places:
We'll have to think how could allocations in those situations be avoided.
Moreover, in this RFC @jeehoonkang proposes introducing realms, which is a way to create multiple garbage collectors and manually register threads to them. Currently, there is only one implicit global epoch GC and threads get lazily and automatically registered behind the scenes. This is similar to Rayon allowing you to create your own thread pool instead of using the global one. Anyways, perhaps this is something that might interest you as well, since you're writing such a low-level thing (an allocator) that needs very precise control?
I wonder if you have any ideas on how we can help you here. More specifically, my questions would be:
no_std
-compatible API that you'd use look like?