Open ryoqun opened 2 years ago
oh, also, i have a small reproducible program and monkey-patched crossbeam-epoch
and rayon
. please let me know if it's needed to drive this issue forward.
Possible solutions: ...
Yet another solution would probably be starting to accept a new type parameter (with default type to be compatible) over Global
/Local
. And that type parameter will be responsible to provide singleton storage for all those registered Local
s.
So that, each thread pool is isolated from each other in terms of crossbeam-epoch
instances. so, the overhead won't happen to begin with. Also, maybe we can configure PINNINGS_BETWEEN_COLLECT
via the new trait.
maybe, this impl strategy is similar std::HashMap's S = RandomState
type param or Vec
's A = Allocator
type param.
@taiki-e hey, I know your status show busy
. But I waited for a week. Could you take a look at this issue, if possible? Or, could you refer this to other maintainers? thanks :)
If we handle this on our end, it seems increasing PINNINGS_BETWEEN_COLLECT or to make it configurable is okay. Which should be chosen depends on what the adverse effects of increasing it would be.
(By the way, I'm not sure why you are using a thread pool specifically intended for expensive CPU-bound computations like rayon to handle a large number of tasks that are mostly idling.)
- Adjust rayon to use separate
Global
s for each thread pools as a kind of scope?
cc @cuviper: Any thoughts on this idea (or this issue)?
With the caveat that I only have a rough intuition of how crossbeam-epoch
actually works... :sweat_smile:
Adjust rayon to use separate
Global
s for each thread pools as a kind of scope?
This sounds like it could be fine for stuff that's really internal to the pool, like the thread-specific deques, but is it a problem to delay collection for stuff that straddles that boundary? I'm thinking of the pool's injector which may be pushed from outside the pool and popped within, or anything that the user brings themself that crosses the pool. If that pool goes idle while the rest of the process is still working heavily, there could be memory in that pool Global
that's not reclaimed. Maybe that's an issue for the internal deques too, I don't know.
Yet another solution would probably be starting to accept a new type parameter (with default type to be compatible) over
Global
/Local
. And that type parameter will be responsible to provide singleton storage for all those registeredLocal
s.
I think this could make sense for the thread deques, if we create them with a pool Global
and have a bounded number of attached Local
just for each thread in the pool. No other thread needs to touch those deques.
oh, also, i have a small reproducible program and monkey-patched
crossbeam-epoch
andrayon
.
Which solution is that, just the increased constant?
... or polish it so that the gc frequency could be adaptive to the number of
Local
s?
That also sounds like a reasonable idea to me.
really appreciate for spending on this issue. :)
If we handle this on our end, it seems increasing PINNINGS_BETWEEN_COLLECT or to make it configurable is okay.
thx for confirmation. i'm thinking to submit a pr to make it configurable.
Which should be chosen depends on what the adverse effects of increasing it would be.
as far as i tested increasing PINNINGS_BETWEEN_COLLECT
even with the 128x factor. i saw no noticeable difference, just saw the nicely greatly-reduced overhead. Originally, i expected large mem consumption. Maybe, solana-validator
's rayon task queue's element byte size isn't significant in general.
it looks like PINNINGS_BETWEEN_COLLECT = 128
is chosen somewhat arbitrarily as far as i looked git/github, right? Is there some past discussions, can i refer to? Also, is there a bench specially regarding this constant?
(By the way, I'm not sure why you are using a thread pool specifically intended for expensive CPU-bound computations like rayon to handle a large number of tasks that are mostly idling.)
Well, recent rayon can park nicely with no outstanding tasks. (very much kudo to @cuviper, ref: https://github.com/rayon-rs/rayon/blob/master/RELEASES.md#release-rayon-140--rayon-core-180-2020-08-24 ) So, we want to maximize burst task throughput utilizing all cores and rayon is very handy for that with various ecosystem .par_iter()
support across crates. :)
Yet another solution would probably be starting to accept a new type parameter (with default type to be compatible) over
Global
/Local
. And that type parameter will be responsible to provide singleton storage for all those registeredLocal
s.I think this could make sense for the thread deques, if we create them with a pool
Global
and have a bounded number of attachedLocal
just for each thread in the pool. No other thread needs to touch those deques.
thx for confirmation again! I'll create pr to rayon as well. stay tuned. :)
oh, also, i have a small reproducible program and monkey-patched
crossbeam-epoch
andrayon
.Which solution is that, just the increased constant?
yeah, the increased constant. here's rayon's one: https://github.com/ryoqun/rayon/commit/48efe667f1a81167a1794369c7eee40b85192d8e (note that it's just directing to upstream crossbeam revs, because of being transitive; i.e. no actual code changes in rayon per se)
... or polish it so that the gc frequency could be adaptive to the number of
Local
s?That also sounds like a reasonable idea to me.
now, i think a process-wide singleton really-global Global
won't be optimal no matter what (even with being adaptive), which is shared by all the isolated thread pool's threads otherwise.
As you already pointed out at the preceding paragraph, I reached to the same conclusion that as long as threads are closely grouped to a indivisual grouping like a thread pool, there should be no concern with multiple Global
s for each thread pool. in that way, i think mem locality will improve and crossbeam-epoch
's general book-keeping (and potentially deferred collection) won't leak into completely unrelated thread pool's thread, which was hard to debug.
@taiki-e @cuviper So, I'll move this forward with the configuration approach. Really thanks for the advice.
Maybe, I'll introduce EpochGroup
trait and EpochGroup::global()
or something at crossbeam-epoch
and pass the type parameter way up to the application code via crossbeam-deque
and rayon
.
Maybe,
solana-validator
's rayon task queue's element byte size isn't significant in general.
Rayon's queues only deal with opaque JobRef
s, which are just two pointers each.
Maybe, I'll introduce
EpochGroup
trait andEpochGroup::global()
or something atcrossbeam-epoch
and pass the type parameter way up to the application code viacrossbeam-deque
andrayon
.
I made this work locally. However, I'm pivoting to the new feature flag approch due to...:
So, I want to take the lazy man's approach #862 along with rayon's corresponding pr (https://github.com/rayon-rs/rayon/pull/958)
@taiki-e, @cuviper: could you take a quick glance to decide this kind of ad-hoc feature addition would be ever merged into your crates? Certainly, I want to avoid vendoring... Or, do you have some stamina to review my original approach prs?
fyi, the original approch's newer api would look like this:
use crossbeam_deque::{CustomCollector, Collector, LocalHandle};
static MY_COLLECTOR: Lazy<Collector> = Lazy::new(Collector::new);
// note that, maybe i need to provide macro_rule! for the following boilerplate code.
thread_local! {
static MY_HANDLE: LocalHandle = MY_COLLECTOR.register();
}
struct MyCustomCollector;
impl CustomCollector for MyCustomCollector {
#[inline]
fn collector() -> &'static Collector {
&MY_COLLECTOR
}
#[inline]
fn handle() -> &'static std::thread::LocalKey<LocalHandle> {
&MY_HANDLE
}
}
fn foo() {
let pool = rayon::ThreadPoolBuilder::<_, MyCustomCollector>::new()
...;
pool.install_for_iters(|type_marker| { // this is new! type_marker is parameterized with CustomCollector
vec.
.into_par_iter()
.install_type(type_marker) // this is the new! we need to propagate static collector type to actual iters.
.map(...)
...
}
}
So, I want to take the lazy man's approach https://github.com/crossbeam-rs/crossbeam/pull/862 along with rayon's corresponding pr (https://github.com/rayon-rs/rayon/pull/958)
As an admittedly uninvolved bystander but user of Rayon, I would like to add that I think that at least this kind of solution might be better put into a private fork of crossbeam-epoch
patched into solana-validator
via Cargo's [patch]
section. This is due to both the non-generality of the solution and the very specific conditions under which it is to be applied.
@adamreichold thanks for chiming in. :) loves collective wisdom. your idea sounds doable. so, i created https://github.com/solana-labs/solana/pull/26555
I was not imagining any public API changes in rayon, not even feature flags. My thought was that we would just internally create a collector in rayon's Registry
and create the appropriate deques with that. But that does still mean that crossbeam
would have to expose that mechanism in their APIs.
Hi, nice to meet you and thanks for maintaining this crate. :)
I found some kind of performance issue spanning cross-crate code interactions, ultimately resulting in many cpu cycles being wasted at
crossbeam-epoch
. And I'm wondering where the proper fix could be placed across those related crates.In short, there are circumstances where
crossbeam-epoch
's epoch bookkeeping code exposes significant overhead. And our production code was hit at it.Dirtiest hack would be reducing frequency of gc collection (=
global().collect(&guard);
) by increasingPINNINGS_BETWEEN_COLLECT
like this:However, I'm not fully sure this is the right fix (esp, regarding its ramifications due to reduced garbage collections).
Let me explain the rather complicated context a bit.
Firstly,
crossbeam-epoch
is used bycrossbeam-deque
(duh), which in turn byrayon
(task scheduler library) as task queues , then bysolana-validator
(which experinced the performance issue; https://github.com/solana-labs/solana/issues/22603).so far, so good. it's just normal use case of rayon for multi cores by an application code.
The twist is that
solana-validator
is holding manyrayon
thread pools managed by its internal subsystems. So, it's well exceeding system's core count by great factor like 2000 threads on a 64 core machine.(We know this is a bit silly setup. But every subsystem isn't 100% cpu persistently. Rather they're mostly idling. On the other hand, we want to maximize processing throughput/minimize latency at the time of peak load. Also, casual
top -H
instropection and granular kernel thread priority tuning is handy. Lastly, sharing a single or several thread pool would introduce unneeded synchronization cost across subsystems and implementaion complexities insolana-validator
code. All in all, each independent thread pools makes sense to us at least for now.)So, that whopping 2000 (rayon) threads means that all of them are registering as
Local
s to the singletoncrossbeam-epoch
'sGlobal
. Then,global().collect()
suddenly become very slow because it's doing linear scan over theLocal
s (= O(n))...(Then, this affects all indepedent rayon pools inside a process. That's because of the use of the singleton
Global
. This seemingly-unrelated subsytem's performance degradtion was hard to debug, by the way...)Regarding the extent of the overhead, I observed certain rayon-heavy subsystem is ~5% faster with the above 1-line hack alone in walltime. And, I also saw 100x reduction of overhead by Linux's
perf
.Possible solutions:
Local
s?Global
s for each thread pools as a kind of scope?solana-validator
?