Noticeable overhead in crossbeam-epoch under many Locals

ryoqun commented 2 years ago

Hi, nice to meet you and thanks for maintaining this crate. :)

I found some kind of performance issue spanning cross-crate code interactions, ultimately resulting in many cpu cycles being wasted at crossbeam-epoch. And I'm wondering where the proper fix could be placed across those related crates.

In short, there are circumstances where crossbeam-epoch's epoch bookkeeping code exposes significant overhead. And our production code was hit at it.

Dirtiest hack would be reducing frequency of gc collection (= global().collect(&guard);) by increasing PINNINGS_BETWEEN_COLLECT like this:

$ git diff
diff --git a/crossbeam-epoch/src/internal.rs b/crossbeam-epoch/src/internal.rs
index de208b1..0394c03 100644
--- a/crossbeam-epoch/src/internal.rs
+++ b/crossbeam-epoch/src/internal.rs
@@ -387,7 +387,7 @@ fn local_size() {
 impl Local {
     /// Number of pinnings after which a participant will execute some deferred functions from the
     /// global queue.
-    const PINNINGS_BETWEEN_COLLECT: usize = 128;
+    const PINNINGS_BETWEEN_COLLECT: usize = 128 * 128;

     /// Registers a new `Local` in the provided `Global`.
     pub(crate) fn register(collector: &Collector) -> LocalHandle {

However, I'm not fully sure this is the right fix (esp, regarding its ramifications due to reduced garbage collections).

Let me explain the rather complicated context a bit.

Firstly, crossbeam-epoch is used by crossbeam-deque (duh), which in turn by rayon (task scheduler library) as task queues , then by solana-validator (which experinced the performance issue; https://github.com/solana-labs/solana/issues/22603).

so far, so good. it's just normal use case of rayon for multi cores by an application code.

The twist is that solana-validator is holding many rayon thread pools managed by its internal subsystems. So, it's well exceeding system's core count by great factor like 2000 threads on a 64 core machine.

(We know this is a bit silly setup. But every subsystem isn't 100% cpu persistently. Rather they're mostly idling. On the other hand, we want to maximize processing throughput/minimize latency at the time of peak load. Also, casual top -H instropection and granular kernel thread priority tuning is handy. Lastly, sharing a single or several thread pool would introduce unneeded synchronization cost across subsystems and implementaion complexities in solana-validator code. All in all, each independent thread pools makes sense to us at least for now.)

So, that whopping 2000 (rayon) threads means that all of them are registering as Locals to the singleton crossbeam-epoch's Global. Then, global().collect() suddenly become very slow because it's doing linear scan over the Locals (= O(n))...

(Then, this affects all indepedent rayon pools inside a process. That's because of the use of the singleton Global. This seemingly-unrelated subsytem's performance degradtion was hard to debug, by the way...)

Regarding the extent of the overhead, I observed certain rayon-heavy subsystem is ~5% faster with the above 1-line hack alone in walltime. And, I also saw 100x reduction of overhead by Linux's perf.

Possible solutions:

Accept this dirty hack as is?
... or polish it so that the gc frequency could be adaptive to the number of Locals?
Adjust rayon to use separate Globals for each thread pools as a kind of scope?
Introduce background gc collection thread like jemalloc?
Reduce thread pool in solana-validator?

ryoqun commented 2 years ago

oh, also, i have a small reproducible program and monkey-patched crossbeam-epoch and rayon. please let me know if it's needed to drive this issue forward.

ryoqun commented 2 years ago

Possible solutions: ...

Yet another solution would probably be starting to accept a new type parameter (with default type to be compatible) over Global/Local. And that type parameter will be responsible to provide singleton storage for all those registered Locals.

So that, each thread pool is isolated from each other in terms of crossbeam-epoch instances. so, the overhead won't happen to begin with. Also, maybe we can configure PINNINGS_BETWEEN_COLLECT via the new trait.

maybe, this impl strategy is similar std::HashMap's S = RandomState type param or Vec's A = Allocator type param.

ryoqun commented 2 years ago

@taiki-e hey, I know your status show busy. But I waited for a week. Could you take a look at this issue, if possible? Or, could you refer this to other maintainers? thanks :)

taiki-e commented 2 years ago

If we handle this on our end, it seems increasing PINNINGS_BETWEEN_COLLECT or to make it configurable is okay. Which should be chosen depends on what the adverse effects of increasing it would be.

(By the way, I'm not sure why you are using a thread pool specifically intended for expensive CPU-bound computations like rayon to handle a large number of tasks that are mostly idling.)

taiki-e commented 2 years ago

Adjust rayon to use separate Globals for each thread pools as a kind of scope?

cc @cuviper: Any thoughts on this idea (or this issue)?

cuviper commented 2 years ago

With the caveat that I only have a rough intuition of how crossbeam-epoch actually works... :sweat_smile:

Adjust rayon to use separate Globals for each thread pools as a kind of scope?

This sounds like it could be fine for stuff that's really internal to the pool, like the thread-specific deques, but is it a problem to delay collection for stuff that straddles that boundary? I'm thinking of the pool's injector which may be pushed from outside the pool and popped within, or anything that the user brings themself that crosses the pool. If that pool goes idle while the rest of the process is still working heavily, there could be memory in that pool Global that's not reclaimed. Maybe that's an issue for the internal deques too, I don't know.

Yet another solution would probably be starting to accept a new type parameter (with default type to be compatible) over Global/Local. And that type parameter will be responsible to provide singleton storage for all those registered Locals.

I think this could make sense for the thread deques, if we create them with a pool Global and have a bounded number of attached Local just for each thread in the pool. No other thread needs to touch those deques.

oh, also, i have a small reproducible program and monkey-patched crossbeam-epoch and rayon.

Which solution is that, just the increased constant?

... or polish it so that the gc frequency could be adaptive to the number of Locals?

That also sounds like a reasonable idea to me.

ryoqun commented 2 years ago

really appreciate for spending on this issue. :)

If we handle this on our end, it seems increasing PINNINGS_BETWEEN_COLLECT or to make it configurable is okay.

thx for confirmation. i'm thinking to submit a pr to make it configurable.

Which should be chosen depends on what the adverse effects of increasing it would be.

as far as i tested increasing PINNINGS_BETWEEN_COLLECT even with the 128x factor. i saw no noticeable difference, just saw the nicely greatly-reduced overhead. Originally, i expected large mem consumption. Maybe, solana-validator's rayon task queue's element byte size isn't significant in general.

it looks like PINNINGS_BETWEEN_COLLECT = 128 is chosen somewhat arbitrarily as far as i looked git/github, right? Is there some past discussions, can i refer to? Also, is there a bench specially regarding this constant?

(By the way, I'm not sure why you are using a thread pool specifically intended for expensive CPU-bound computations like rayon to handle a large number of tasks that are mostly idling.)

Well, recent rayon can park nicely with no outstanding tasks. (very much kudo to @cuviper, ref: https://github.com/rayon-rs/rayon/blob/master/RELEASES.md#release-rayon-140--rayon-core-180-2020-08-24 ) So, we want to maximize burst task throughput utilizing all cores and rayon is very handy for that with various ecosystem .par_iter() support across crates. :)

ryoqun commented 2 years ago

Yet another solution would probably be starting to accept a new type parameter (with default type to be compatible) over Global/Local. And that type parameter will be responsible to provide singleton storage for all those registered Locals.

I think this could make sense for the thread deques, if we create them with a pool Global and have a bounded number of attached Local just for each thread in the pool. No other thread needs to touch those deques.

thx for confirmation again! I'll create pr to rayon as well. stay tuned. :)

oh, also, i have a small reproducible program and monkey-patched crossbeam-epoch and rayon.

Which solution is that, just the increased constant?

yeah, the increased constant. here's rayon's one: https://github.com/ryoqun/rayon/commit/48efe667f1a81167a1794369c7eee40b85192d8e (note that it's just directing to upstream crossbeam revs, because of being transitive; i.e. no actual code changes in rayon per se)

... or polish it so that the gc frequency could be adaptive to the number of Locals?

That also sounds like a reasonable idea to me.

now, i think a process-wide singleton really-global Global won't be optimal no matter what (even with being adaptive), which is shared by all the isolated thread pool's threads otherwise.

As you already pointed out at the preceding paragraph, I reached to the same conclusion that as long as threads are closely grouped to a indivisual grouping like a thread pool, there should be no concern with multiple Globals for each thread pool. in that way, i think mem locality will improve and crossbeam-epoch's general book-keeping (and potentially deferred collection) won't leak into completely unrelated thread pool's thread, which was hard to debug.

@taiki-e @cuviper So, I'll move this forward with the configuration approach. Really thanks for the advice.

Maybe, I'll introduce EpochGroup trait and EpochGroup::global() or something at crossbeam-epoch and pass the type parameter way up to the application code via crossbeam-deque and rayon.

cuviper commented 2 years ago

Maybe, solana-validator's rayon task queue's element byte size isn't significant in general.

Rayon's queues only deal with opaque JobRefs, which are just two pointers each.

ryoqun commented 2 years ago

Maybe, I'll introduce EpochGroup trait and EpochGroup::global() or something at crossbeam-epoch and pass the type parameter way up to the application code via crossbeam-deque and rayon.

I made this work locally. However, I'm pivoting to the new feature flag approch due to...:

The whole change set is kind of large due to the need of passing new type parameter across board rayon->crossbeam-{deque, epoch}, which is tough for both me and both of reviewers (you).
Also, application code must be adjusted to use the newly-introduced apis with the type parameter, which is again tedious.
all in all, i think solana's problem is quite rare and niche, considering no one complained and seeing no outsider reaction to this issue. so not worth for the right-but-large solution, imo.

So, I want to take the lazy man's approach #862 along with rayon's corresponding pr (https://github.com/rayon-rs/rayon/pull/958)

@taiki-e, @cuviper: could you take a quick glance to decide this kind of ad-hoc feature addition would be ever merged into your crates? Certainly, I want to avoid vendoring... Or, do you have some stamina to review my original approach prs?

ryoqun commented 2 years ago

fyi, the original approch's newer api would look like this:

use crossbeam_deque::{CustomCollector, Collector, LocalHandle};

static MY_COLLECTOR: Lazy<Collector> = Lazy::new(Collector::new);

// note that, maybe i need to provide macro_rule! for the following boilerplate code.
thread_local! {
    static MY_HANDLE: LocalHandle = MY_COLLECTOR.register();
}

struct MyCustomCollector;

impl CustomCollector for MyCustomCollector {
    #[inline]
    fn collector() -> &'static Collector {
        &MY_COLLECTOR
    }

    #[inline]
    fn handle() -> &'static std::thread::LocalKey<LocalHandle> {
        &MY_HANDLE
    }
}

fn foo() {
    let pool = rayon::ThreadPoolBuilder::<_, MyCustomCollector>::new()
        ...;
    pool.install_for_iters(|type_marker| { // this is new! type_marker is parameterized with CustomCollector
        vec.
           .into_par_iter()
           .install_type(type_marker) // this is the new! we need to propagate static collector type to actual iters.
           .map(...)
        ...
    }
}

adamreichold commented 2 years ago

So, I want to take the lazy man's approach https://github.com/crossbeam-rs/crossbeam/pull/862 along with rayon's corresponding pr (https://github.com/rayon-rs/rayon/pull/958)

As an admittedly uninvolved bystander but user of Rayon, I would like to add that I think that at least this kind of solution might be better put into a private fork of crossbeam-epoch patched into solana-validator via Cargo's [patch] section. This is due to both the non-generality of the solution and the very specific conditions under which it is to be applied.

ryoqun commented 2 years ago

@adamreichold thanks for chiming in. :) loves collective wisdom. your idea sounds doable. so, i created https://github.com/solana-labs/solana/pull/26555

cuviper commented 2 years ago

I was not imagining any public API changes in rayon, not even feature flags. My thought was that we would just internally create a collector in rayon's Registry and create the appropriate deques with that. But that does still mean that crossbeam would have to expose that mechanism in their APIs.

crossbeam-rs / crossbeam

Noticeable overhead in crossbeam-epoch under many Locals #852