fitzgen / bumpalo

A fast bump allocation arena for Rust
https://docs.rs/bumpalo
Apache License 2.0
1.41k stars 111 forks source link

Provides implementation of `Vec::extend_from_slice` optimized for `T: Copy` #236

Closed zslayton closed 7 months ago

zslayton commented 7 months ago

Addresses #235.

This PR is meant to be a jumping-off point for discussion. It adds a new inherent impl method to Vec:

impl<'bump, T: 'bump + Copy> Vec<'bump, T> {
    /// ...
    pub fn extend_from_slice_copy(&mut self, other: &[T]) {
    // ...
    }
}

It moves the logic added in #229 from String to Vec and then re-implements push_str by calling the new extend_from_slice_copy.

I've duplicated the push_str benchmarks but for Vec<'_, u8> and the performance improves as expected: image

zslayton commented 7 months ago

@overlookmotel expressed a concern that unconditionally using copy_nonoverlapping might result in small performance regressions in the case of a small slices/strings.

I extended the benchmarks in this PR to test a variety of sizes between 4 bytes and 16KB. The results--in this microbenchmark, running on my machine--indicated that copy_nonoverlapping is a win for slices of any size.

image image

overlookmotel commented 7 months ago

That's interesting! I wonder why we saw a slow-down in OXC (though an extremely small one) with the other change.

The other explanations I can think of are:

  1. criterion::black_box is not doing its job properly in your benchmarks, which would allow the compiler to convert the copy_nonoverlapping calls to inlined copies.
  2. It performs better with slice lengths which are multiples of 8 / 16 etc (seems plausible). Maybe try a "weird" length like 5 or 7 and see how it does with that?

Sorry, not trying to pick holes in your work, just trying to understand.

overlookmotel commented 7 months ago

Regarding short strings and possible perf regressions: nothing I've read seems serious enough to warrant reverting that optimization. I'm happy to receive follow up PRs that further improve things here, if there is anything further to improve.

Yes absolutely, the performance "regression" we saw was absolutely tiny, and for general purpose use 80x faster for long strings outweighs 0.1% slower for very short strings. I just wanted to put it on the record that there was some downside to the PR I made, rather than leave it unsaid, in case it affects anyone else.

zslayton commented 7 months ago

It performs better with slice lengths which are multiples of 8 / 16 etc (seems plausible). Maybe try a "weird" length like 5 or 7 and see how it does with that?

This seemed plausible to me too! I added a few non-power-of-2 cases to the benchmark to check it out, but it looks like the improvement holds for those as well.

image image

overlookmotel commented 7 months ago

@zslayton Thanks for tolerating my nitty comments, and for benching shorter byte ranges. I remain curious about why we saw the slight slow-down we did on similar PR for strings, but probably I should stop worrying!

zslayton commented 7 months ago

@zslayton Thanks for tolerating my nitty comments, and for benching shorter byte ranges.

Sure thing! I appreciate having another set of eyes on this, I'd rather spend some time double-checking things than accidentally cause a regression for all of these folks 🙃😬.

I remain curious about why we saw the slight slow-down we did on similar PR for strings, but probably I should stop worrying!

I'm curious too!