crossbeam-rs / rfcs

RFCs for changes to Crossbeam
Apache License 2.0
146 stars 13 forks source link

Replace scopes with guards #20

Closed ghost closed 6 years ago

ghost commented 6 years ago

Rendered

Implementation

Closes #19.

ghost commented 6 years ago

@joshlf Could you please take a look at the Accessing the default handle section, since it is related to elfmalloc? You've been dealing with thread-locals and tuning performance a lot, so your opinion would be very helpful here.

@jeehoonkang is advocating for the 1st option, while I want the 3rd. But in the end, I don't feel particularly strongly about it.

However, there is one thing I do strongly care about: let's avoid unsafe APIs whenever possible. In that spirit, the 2nd option would be better than the 1st one.

ghost commented 6 years ago

@jeehoonkang

I appreciate your concerns about the performance of the 3rd option. However, I'd like put some things into perspective first. Take a look at the following benchmarks:

#[bench]
fn pin_empty(b: &mut Bencher) {
    b.iter(|| epoch::pin());
}

#[bench]
fn default_handle_pin(b: &mut Bencher) {
    b.iter(|| epoch::default_handle().pin());
}

#[bench]
fn local_handle_pin(b: &mut Bencher) {
    let handle = epoch::default_handle();
    b.iter(|| handle.pin());
}

#[bench]
fn thread_local_handle_pin(b: &mut Bencher) {
    thread_local! {
        static MY: Handle = epoch::default_handle();
    }
    b.iter(|| MY.with(|handle| handle.pin()));
}

Results:

test default_handle_pin      ... bench:          13 ns/iter (+/- 1)
test local_handle_pin        ... bench:           9 ns/iter (+/- 1)
test pin_empty               ... bench:          12 ns/iter (+/- 1)
test thread_local_handle_pin ... bench:           9 ns/iter (+/- 0)

Looks like pinning itself costs 9 nanoseconds. Accessing the thread-local (that's LOCAL.with(...) inside epoch::pin()) adds 3 more nanoseconds. Finally, incrementing/decrementing the reference count to access the default handle costs 1 more nanosecond.

However, if you store the handle in an another thread-local (in this example, MY) and access it with MY.with(...), you don't pay the cost of reference counting. This is also faster than epoch::pin(), perhaps due to better inlining opportunities (?).

The point is, that 1 nanosecond paid on reference counting really doesn't matter at all. If you need maximum performance, it's easy to have it by creating a custom thread-local anyway.

jeehoonkang commented 6 years ago

@stjepang thanks for the kind explanation. Now I prefer the 3rd option. Let's go for it :)

For safety, I incorrectly assumed that both 1st and 3rd options invoke undefined behavior. However, the 3rd option seems to raise a specific panic: HANDLE is already destroyed. It's arguably less unsafe, I think.