Improve performance of `extend_from_slice` where `T: Copy`

fitzgen / bumpalo

A fast bump allocation arena for Rust

https://docs.rs/bumpalo

Apache License 2.0

1.34k stars 109 forks source link

Improve performance of `extend_from_slice` where `T: Copy` #235

Closed zslayton closed 4 months ago

zslayton commented 4 months ago

When using a Vec<'_, u8>, I was surprised to see that my_vec.extend_from_slice("SUCCESS".as_bytes()); produced assembly that looks like this on x86_64:

(mac OS 14.3 Sonoma, Intel processor, rustc v1.76.0)

Notice that it's performing a reserve, error check, and push for each byte in the input slice. I had expected LLVM to reduce this to something like a single reserve, error check, and memcpy for the entire slice.

I'd like to contribute a change analogous to the one recently merged in #229 but for &[u8] instead of &str. Before I get started, I wanted to check whether 1) you'd be interested in merging such an optimization and 2) what the API should look like (since we don't have specialization to customize extend_from_slice for Vec<'_, u8>).

overlookmotel commented 4 months ago

@zslayton Side question: What did you use to produce the visualization of assembly branch structure above? It's really nice!

Please note, we actually saw a very slight performance degredation from #229 in OXC https://github.com/oxc-project/oxc/pull/2417. I believe this is due to OXC only dealing with really short strings, where copying byte-by-byte (presumably inlined) is actually faster than a call to copy_nonoverlapping. I don't know if this should be addressed or not, as it's probably an unusual case (std does not address it). I should have made a benchmark for short strings as well as long.

zslayton commented 4 months ago

What did you use to produce the visualization of assembly branch structure above?

I used Cutter, an open source reverse engineering tool. It's very helpful!

Please note, we actually saw a very slight performance degredation from https://github.com/fitzgen/bumpalo/pull/229 in OXC https://github.com/oxc-project/oxc/pull/2417. I believe this is due to OXC only dealing with really short strings, where copying byte-by-byte (presumably inlined) is actually faster than a call to copy_nonoverlapping. I don't know if this should be addressed or not, as it's probably an unusual case (std does not address it). I should have made a benchmark for short strings as well as long.

That's good food for thought, thank you for raising it--I'll defer to @fitzgen as to whether it needs to be addressed. My use case is a mix of short strings and byte arrays that are anywhere from empty to several kilobytes in typical scenarios. The change in #236 was a very large performance increase in all of my existing benchmarks, but I didn't closely examine the performance of small slices.

zslayton commented 4 months ago

@overlookmotel I ended up extending the benchmark to test a variety of input sizes. You can see the results here.

zslayton commented 4 months ago

This was fixed by #236.