Improved epoch-based GC

crossbeam-rs / rfcs

RFCs for changes to Crossbeam

Apache License 2.0

146 stars 13 forks source link

Improved epoch-based GC #3

Closed ghost closed 7 years ago

ghost commented 7 years ago

This is a proposal for improvements on the epoch-based garbage collector.

Rendered

ghost commented 7 years ago

This RFC is still a work in progress.

The "Motivation" sections offers a comprehensive overview of different aspects of epoch-based garbage collection, and what problems come up with it. I wrote the text mostly by recollecting the previous discussions on GitHub, Rust forums, and IRC.

I'd like to get some eyeballs on all this and hear what you think. The text is already long and dense, but there might still be important topics I missed, so let me know!

The "Detailed design" section is empty, but I will write it up after we get a clearer picture on what kind of design we want.

Here are my thougts on all this:

The way epochs are tracked in Coco is pretty good. Pinning and garbage collection is very efficient, and we need something like that in Crossbeam.
Incremental collection is a must. I like the idea of keeping collected garbage in the staging area until the thread gets unpinned.
Regarding specialized GCs (local to data structures), I'd like to discuss the topic in detail, but probably leave hard decisions for a later RFC. I'm unsure whether we'll need them at all, for various reasons.
I'd like support for destructors in the global epoch GC, and the DeferredDropSafe seems like a good solution.

@joshtriplett If you have time and interest in participating in this RFC, your expertise might be of great help!

arthurprs commented 7 years ago

Some initial thoughts:

+1 for "style 4" tracking, it's understandable and there's an working implementation. Any possible extreme scaling issues could be tackled later.

+1 for thread garbage buffers for the reasons already stated in the rfc.

-1 for delayed garbage collection, we really shouldn't try to tackle problems that we don't have (yet). I also fell that people requiring that level of precision need much more specific things for us to discuss at this point (knobs being a big one).

Regarding drops, I fell that trying to protect against bad interactions is a lost battle. As long as they're ran there's no way to make it completely safe. That being said, we should provide just the bare minimum safety net (Send + 'static) and the collection/user using crossbeam know better and should figure/document the rest.

To collect some garbage, a thread should pick up some garbage from its thread-local cache and, if it has time for more, also steal some from the global pool.

has time requires explanation, or just remove any timing constraints for now.

schets commented 7 years ago

Some more thoughts:

+1 on style 4 tracking and thread-local garbage buffers for reasons stated in the rfc.

I give a +1 on basic delayed GC since it's pretty simple to implement and is quite useful. You can just pin without doing any local/global GC (or at least no allocations, deallocations, or extra atomics).

+1 on incremental GC as well. I have an implementation here that can easily be updated for any sane thread-local buffer scheme. At first we should probably just consider 'time' to be some limit on allocations and later think about advanced knobs and tuning. There needs to be a mechanism so enforce that threads can't indefinitely pile up garbage.

Addressing how to prevent garbage stuck locally could be done at first with a global queue structure like the one used in Coco. It's probably worth adding some checks so that threads don't pointlessly push garbage into the global queue.

On thread registration, I think there might a way to do this with a vector of pointers instead of a linked list. That would be great for reducing latency of the epoch check although would complicate thread registration.

arthurprs commented 7 years ago

A little while ago I was talking with @stjepang over IRC and I suggested that each thread tracked the amount garbage they generated. By doing that each thread can estimate a minimum amount they need to collect and how urgent it is. As a side effect it prevents unbounded garbage piling and helps work distribution.

The tracking can be done in several ways: it can be as simple as the number of deallocations or maybe a bit better by considering the size of the deallocation; maybe by number of bags; maybe only considering stuff pushed to the global pool to amortize the cost, etc..

If thread A produced X units of garbage it it tries to collect something like min(Y * X, MIN) units, starting with the local and then the global garbage. Y needs to be > 1, so threads can help each other, proportionally to the amount of garbage they generate themselves.

Auto collection can be triggered every P pins or after X exceeds a limit.

Edits: fixed broken english

ghost commented 7 years ago

@schets

I give a +1 on basic delayed GC since it's pretty simple to implement and is quite useful. You can just pin without doing any local/global GC (or at least no allocations, deallocations, or extra atomics).

Cool, glad you like this idea! You expressed interest (over IRC) in using Crossbeam for low/predictable latency stuff. Would something like epoch::delay_gc be good enough? Do you have any use cases where we should provide even stricter control over GC?

At first we should probably just consider 'time' to be some limit on allocations and later think about advanced knobs and tuning. There needs to be a mechanism so enforce that threads can't indefinitely pile up garbage.

I like your implementation - that's a good way of measuring garbage and time spent on collection. But I think we should have even more precision when dealing with garbage. Not all garbage is created equal, as explained in the "Garbage objects" section.

For example, if we're destroying a queue node or skiplist node, that's worth one unit. However, if we're destroying a B-tree node containing 100 keys, it's worth 100 units. Likewise, destroying a large array of length 10K backing a Chase-Lev deque would be worth 10K units.

These units are not perfect. For example, destroying a skiplist node containing an i32 is not the same as destroying one that contains a [u64; 100] or a String. Perhaps it'd be wise to measure garbage in terms of memory use or the cost of destructors, but that may be too difficult to estimate. I think measuring garbage in terms of "elements" (e.g. [u64; 100] has 100 elements) is generally a pretty good approximation.

The function that adds garbage might have the following signature:

// Array of `T`s stored at address `object`, containing `length` elements.
// This garbage is worth `units`.
unsafe fn defer_free<T>(&self, object: *const T, length: usize, units: usize, scope: &Scope);

This piece of garbage is pushed into the current thread-local bag. If the bag now has more than HUGE_BAG units, it is sealed and moved directly into the global queue and a new empty thread-local bag is created. Otherwise, if it has more than FULL_BAG units, it is sealed and a new empty thread-local bag is created. While the sum of costs of all sealed thread-local bags is more than MAX_SEALED, we push the oldest sealed bag into the global queue.

On thread registration, I think there might a way to do this with a vector of pointers instead of a linked list. That would be great for reducing latency of the epoch check although would complicate thread registration.

That may work, but it's tricky to do. The problem is that we must load the current vector, clone it, push a new entry, and then use CAS to install the new vector. But someone may destroy the current vector while we're cloning it!

With linked lists this is easier. We only need to get the address of the current head and use CAS to install a new node as the head. We don't read the contents of the nodes, so other threads can't wreak havoc by destroying them.

@arthurprs

-1 for delayed garbage collection, we really shouldn't try to tackle problems that we don't have (yet). I also fell that people requiring that level of precision need much more specific things for us to discuss at this point (knobs being a big one).

Fair enough. We certainly don't need fine control over GC right now, but I'd still like to at least think about these problems in order to see what kind of flexibility we need to build into the core GC design.

Regarding drops, I fell that trying to protect against bad interactions is a lost battle. As long as they're ran there's no way to make it completely safe.

What do you mean by "there's no way to make it completely safe"? If we restrict deferred drops to types T: Send + 'static, executing these drops will be completely safe. Perhaps you meant "there's no way to make it safe without overrestricting types"? If so, I agree.

If thread A produced X units of garbage it it tries to collect something like min(Y * X, MIN) units, starting with the local and then the global garbage.

Yep, that sounds great! If we are going to use units as described above (1 unit = 1 "element") for deciding when to seal bags and move them into the global queue, then we could simply count sealings to decide how much garbage to destroy. For example, if the thread has sealed 5 bags, then we can destroy Y * 5 bags of garbage in order to make up for it.

Auto collection can be triggered every P pins or after X exceeds a limit.

Fully agree - we want both triggers.

arthurprs commented 7 years ago

I like and agree with your ideas, but for a first cut it's too much on the complicated side. Could we stick with something simpler for the first cut? Like basic auto-weighting based on the size of T? See how it works out and take it from there.

jeehoonkang commented 7 years ago

Thank you for the efforts for this PR. It's huge :) Here are my thoughts:

+1 for "style 4" and the staged area for garbages. Though I didn't quite understand the diff between style 1 and 4.
"Slow thread pinning": you said that the pin operation in Crossbeam executes 6 atomic operations and a full fence, while that in Coco executes 3 atomic operations and a full fence, and that's the reason of the performance difference. But all atomic operations seems to be Relaxed, so I am not really sure the diff is due to the atomic operations. Rather, I guess that the special code for x86 in Coco shines in this place (https://github.com/stjepang/coco/blob/master/src/epoch/thread.rs#L102). What do you think?
- Side note: the x86 code for pin() seems working, but I think we need a method to establish the correctness beyond the testing. We don't even know yet what is the semantics of a program mixing C/C++ and x86 concurrency primitives. Maybe it is beyond our concern, but I just want to note the difficulty.
On the interface of delay_gc(): How about this alternative set of methods?
- pin(): as is
- pin_no_gc(): pin(), but does not perform GC
- gc(): it may or may not be in a Crossbeam scope. It waits for the other threads for performing GC.
- try_gc() -> bool: similar to gc(), but doesn't wait for the other threads. Returns if GC is performed.
Motivation: I feel it is cumbersome to nest pin() inside delay_gc(). Rather I would like to specify the location where GC is worth performed.
+1 for "Large objects must be collected more eagerly than small objects in order to reclaim wasted memory faster." Maybe we can send the GC priority to unlink() as an argument. What do you think?
On oversubscription: can we have per-processor garbages rather than per-thread ones? Then maybe we can overcome the oversubscription problem.
May I ask why? "Crossbeam's GC must work seamlessly and be as unintrusive as possible. For example, using process signals to coordinate garbage collection is out of the question." As far as the user can ignore the implementation details, I think we can utilize everything we can have.

I know this issue is subtle, and I also want to stick to the standard language definition. But to my eyes, using x86 assembly (instead of C/C++ concurrency), sys_membarrier, or process signals all deviate from the standard, and I don't object to one more than the others.
For thread registration:

That may work, but it's tricky to do. The problem is that we must load the current vector, clone it, push a new entry, and then use CAS to install the new vector. But someone may destroy the current vector while we're cloning it!

I agree with @schets that vector-based DS will improve the latency of epoch check. I guess we can avoid the problem @stjepang anticipates by employing the trick used in the fetch-add-based wait-free queue (http://chaoran.me/assets/pdf/wfq-ppopp16.pdf). It may be better to be discussed in a different RFC.
I haven't read "Destructors".. I will reread it and leave more comments.

arthurprs commented 7 years ago

On oversubscription: can we have per-processor garbages rather than per-thread ones? Then maybe we can overcome the oversubscription

I thought about that before but my impression was that it would be a performance hit in most cases. It's more expensive to fetch the corresponding struct (there are some tricks we can use to avoid calling sched_getcpu often) and it may require extra synchronization (the preemption unit is still the thread). Once you take into account that crossbeam pin() is measured in low ns it starts to look really hard.

ghost commented 7 years ago

@jeehoonkang

Though I didn't quite understand the diff between style 1 and 4.

Style 1 is current Crossbeam (three distinct vectors of garbage), and style 4 is Coco (just one long queue of garbage).

But all atomic operations seems to be Relaxed, so I am not really sure the diff is due to the atomic operations. Rather, I guess that the special code for x86 in Coco shines in this place

The CAS trick definitely helps a lot, but is not the only contributor to better performance. Without it pinning takes 20 ns, which is still quite a bit faster than pinning in Crossbeam.

We don't even know yet what is the semantics of a program mixing C/C++ and x86 concurrency primitives. Maybe it is beyond our concern, but I just want to note the difficulty.

Valid point. That CAS is definitely a bit sketchy. :) We'd probably be safe if we used inline assembly to directly call lock xchg, though.

Motivation: I feel it is cumbersome to nest pin() inside delay_gc(). Rather I would like to specify the location where GC is worth performed.

Hmm, I think we have two different kinds of actors:

Crossbeam users. They might want to avoid GC in some sections of the code. They usually don't care about pinning.
Crossbeam data structures. They don't care about avoiding GC. They call pin in order to protect nodes from being reclaimed.

Users of Crossbeam would simply call delay_gc and that's it. They wouldn't nest calls to pin - that's what data structure implementations do. I suspect some confusion arose because the text is a bit weak - I need to explain all this better.

Maybe we can send the GC priority to unlink() as an argument. What do you think?

Yes, absolutely. We need some measure of priority/urgency/size. I'll try to write a concrete proposal on that.

On oversubscription: can we have per-processor garbages rather than per-thread ones? Then maybe we can overcome the oversubscription problem.

The problem mostly stems from the fact that a pinned thread might get preempted and prevent epoch advancement until it gets resumed. This may block garbage collection for a long time.

Having per-processor garbage wouln't help much. What would help, though, is if we could tell the OS: "If possible, please don't preempt the thread while executing this critical section." AFAIK, RCU in the Linux kernel works that way - it is guaranteed that no critical section ever gets preempted.

As far as the user can ignore the implementation details, I think we can utilize everything we can have.

Certainly. I think my main gripe with signals is that they reserve a signal (e.g. USR1), so users of Crossbeam must be aware of that because now they can't use that signal for themselves. In that sense signals are intrusive - they may restrict code unrelated to Crossbeam in some way.

I agree with @schets that vector-based DS will improve the latency of epoch check.

Unless I misunderstood something, that vector would simply be a vector of pointers to thread entries, which is no better than a linked list of thread entries.

Perhaps you had in mind a long vector with adjacent entries in memory? Registration would then involve iterating over the vector and reserving a vacant entry. If so, yeah, that would probably make checking epochs faster.

This week I'll make another iteration over the whole text, try to clean it up, and make a more concrete proposal.

schets commented 7 years ago

I've got a high-level comment on this - this seems like another mega-rfc. Many of the items seem orthogonal to each other (pinning method, priority tracking, object holding methods)

@stjepang a vector of pointers is significantly better than a linked list in terms of iteration time! With a linked list, each iteration of the loop carries a data dependency on the last while in a vector there is no data dependency. This means that for any out-of-order processor one can schedule multiple iterations of the loop in parallel and as a result have multiple cache misses in flight. Since this list isn't being read frequently and almost every is being written to by a different thread than the reader, the case where each element is in the shared cache or main memory is the normal use case.

I have another gripe with signals - they can come at inconvenient times! With crossbeam as-is, you only incur epoch+GC costs when you enter an epoch. A signal might hit at any time including latency-sensitive periods.

I'm not sure if forcing priority in default unlinked is a good idea. Users probably don't want to try and decide how to prioritize garbage amongst stuff from other unknown data structures. We could add a different unlinked_priority function, but I expect that in normal use that would almost always prioritize by size.

@stjepang @jeehoonkang is right about the atomics - relaxed atomics compile to standard loads/stores and the performance gain is from merging the addition and the fence on x86. There may be a minor performance gain from merging the load of the in_critical counter and the thread-local epoch counter but I don't think that's enough to show in benchmarks

@stjepang I think I misunderstood the difference between GC #1 and GC #4. They appear extremely similar with some small implementation details in how garbage is internally tracked and pinning announced to the global GC. They both collect garbage locally, announce the current epoch by recording the current global one, and pin by fencing. We should delegate small implementation details there to another rfc because I don't actually think it's clear which is better.

ghost commented 7 years ago

First, apologies for the late reply. I'll try to be more responsive in the future.

@schets is right - this is too much material to digest in a single RFC. I've simplified the text and would like us to start with a reasonably good and simple design. Something that just suports simple drops, incremental collection, limited thread-local garbage, has faster pinning, and can urgently reclaim large garbage. What is proposed here is the first step towards better garbage collection. I'm open to future alterations of the design, API, and implementation details.

@schets Regarding the GC variant we should use - I see the GC variant #4 as basically a generalization of variant #1 - instead of having three vectors where two of them are blocked we have an infinite number of vectors. I'm leaning towards #4, but you're right that it's not a simple decision - let's just try both and see what works best. shrug. :) Implementation details are up in the air.

Regarding deferred drops - let's keep it simple for now. There will be a time in the future to explore this direction in more detail.

For urgent reclamation (e.g. deque must be able to quickly reclaim old buffers) let's just use flush-based approach (just like in Coco) for now. Ideally I'd prefer to have something like @schets' incremental branch in Crossbeam.

Vtec234 commented 7 years ago

Very nice! Great, all-around understandable RFC with a clear direction forward. I agree with the layout for the new design. I'm not quite sure I got the section on "Destructors and local GCs" though.

Those that allow borrowing elements. Examples are sets and maps. Find operation simply returns a reference to the found element. Removing an element involves finding it and marking it as removed.

This is mentioned, but the following section only outlines solutions to the problem of elements containing non-static references, rather than the problem of returning references from the structure:

The epoch GC will destroy garbage at some time in the future, but it doesn't guarantee when exactly. Since T might hold non-static references, we mustn't destroy Ts after those references expire. There are three solutions to the problem:

To expand on returning references from a structure, since the reference is now in user code, it might live as long as the structure. For this reason, we cannot destroy any element before the structure destructor runs - this is very bad, since a lot of garbage will accumulate until then. Combined with the problem of non-static references in elements, we have to destroy everything that the structure ever touched in the structure's destructor. For this reason, I think we really shouldn't allow this. A crossbeam issue outlines two solutions - either return some kind of EpochGuard, also pretty bad, since it may pin for long if the user is careless, or provide a with(key: &Key, fn: F) where F: FnOnce(Option<&Val>). The closure can then operate on the reference, but hopefully (and we document this) returns relatively quickly to not pin for too long.

If we restrict ourselves to just allowing some non-static references inside elements, then we have to make sure everything is destroyed before the structure, but there isn't the problem of not being able to destroy anything until the structure itself is destroyed.

Another thing that we discussed on IRC sometime ago is that we should make it clear that even with non-static references in deferred drop, not all kinds of them are allowed - specifically, only references that outlive the structure (or references to the structure itself) are valid. If for example we were to reference a method's local stack, that would crash horribly after the method finishes and we are left with a structure that still references some local variable from that method. This is important, so I feel like it should be mentioned.

ghost commented 7 years ago

@Vtec234 Good suggestion - I've added a paragraph or two that expand on the topic.

ghost commented 7 years ago

Has anyone else had the time to review the recent changes to the text?

I'd really like us to push this forward, and soon start implementing the basic features that are missing in the current GC. Then we can improve it over time and try all the competing implementation details (like how to track epochs, how to store garbage in memory, how to represent a list of registered threads, etc.).

jeehoonkang commented 7 years ago

Great RFC. My only comment is, what is the semantics for the case that a Ptr<T> is called defer* multiple times? If it is not safe to do so, where should we put unsafe?

ghost commented 7 years ago

@jeehoonkang

If I got that right, your question is basically "where are the boundaries of unsafe code"? It's really difficult to say - I believe we still don't have a fully conclusive answer to that question.

We've always had this rule of thumb: "if you have an unsafe block anywhere in your code, then the whole surrounding module is already polluted". Gankro talks about that in his talk (around 10:00).

Niko also published a great deal of blog posts on this topic (e.g. this one).

jeehoonkang commented 7 years ago

@stjepang FYI, on the semantics of unsafe, RustBelt people made their draft paper available online (very) recently: https://people.mpi-sws.org/~dreyer/papers/rustbelt/paper.pdf I am reading it now.

schets commented 7 years ago

This looks great, and we can (and should) hash out more minor differences in more specialized rfcs and PRs