Closed zslayton closed 7 months ago
@overlookmotel expressed a concern that unconditionally using copy_nonoverlapping
might result in small performance regressions in the case of a small slices/strings.
I extended the benchmarks in this PR to test a variety of sizes between 4 bytes and 16KB. The results--in this microbenchmark, running on my machine--indicated that copy_nonoverlapping
is a win for slices of any size.
That's interesting! I wonder why we saw a slow-down in OXC (though an extremely small one) with the other change.
The other explanations I can think of are:
criterion::black_box
is not doing its job properly in your benchmarks, which would allow the compiler to convert the copy_nonoverlapping
calls to inlined copies.Sorry, not trying to pick holes in your work, just trying to understand.
Regarding short strings and possible perf regressions: nothing I've read seems serious enough to warrant reverting that optimization. I'm happy to receive follow up PRs that further improve things here, if there is anything further to improve.
Yes absolutely, the performance "regression" we saw was absolutely tiny, and for general purpose use 80x faster for long strings outweighs 0.1% slower for very short strings. I just wanted to put it on the record that there was some downside to the PR I made, rather than leave it unsaid, in case it affects anyone else.
It performs better with slice lengths which are multiples of 8 / 16 etc (seems plausible). Maybe try a "weird" length like 5 or 7 and see how it does with that?
This seemed plausible to me too! I added a few non-power-of-2 cases to the benchmark to check it out, but it looks like the improvement holds for those as well.
@zslayton Thanks for tolerating my nitty comments, and for benching shorter byte ranges. I remain curious about why we saw the slight slow-down we did on similar PR for strings, but probably I should stop worrying!
@zslayton Thanks for tolerating my nitty comments, and for benching shorter byte ranges.
Sure thing! I appreciate having another set of eyes on this, I'd rather spend some time double-checking things than accidentally cause a regression for all of these folks 🙃😬.
I remain curious about why we saw the slight slow-down we did on similar PR for strings, but probably I should stop worrying!
I'm curious too!
Addresses #235.
This PR is meant to be a jumping-off point for discussion. It adds a new inherent impl method to
Vec
:It moves the logic added in #229 from
String
toVec
and then re-implementspush_str
by calling the newextend_from_slice_copy
.I've duplicated the
push_str
benchmarks but forVec<'_, u8>
and the performance improves as expected: