Closed zslayton closed 4 months ago
@zslayton Side question: What did you use to produce the visualization of assembly branch structure above? It's really nice!
Please note, we actually saw a very slight performance degredation from #229 in OXC https://github.com/oxc-project/oxc/pull/2417. I believe this is due to OXC only dealing with really short strings, where copying byte-by-byte (presumably inlined) is actually faster than a call to copy_nonoverlapping
. I don't know if this should be addressed or not, as it's probably an unusual case (std
does not address it). I should have made a benchmark for short strings as well as long.
What did you use to produce the visualization of assembly branch structure above?
I used Cutter, an open source reverse engineering tool. It's very helpful!
Please note, we actually saw a very slight performance degredation from https://github.com/fitzgen/bumpalo/pull/229 in OXC https://github.com/oxc-project/oxc/pull/2417. I believe this is due to OXC only dealing with really short strings, where copying byte-by-byte (presumably inlined) is actually faster than a call to copy_nonoverlapping. I don't know if this should be addressed or not, as it's probably an unusual case (std does not address it). I should have made a benchmark for short strings as well as long.
That's good food for thought, thank you for raising it--I'll defer to @fitzgen as to whether it needs to be addressed. My use case is a mix of short strings and byte arrays that are anywhere from empty to several kilobytes in typical scenarios. The change in #236 was a very large performance increase in all of my existing benchmarks, but I didn't closely examine the performance of small slices.
@overlookmotel I ended up extending the benchmark to test a variety of input sizes. You can see the results here.
This was fixed by #236.
When using a
Vec<'_, u8>
, I was surprised to see thatmy_vec.extend_from_slice("SUCCESS".as_bytes());
produced assembly that looks like this onx86_64
:(mac OS 14.3 Sonoma, Intel processor,
rustc
v1.76.0)Notice that it's performing a reserve, error check, and push for each byte in the input slice. I had expected LLVM to reduce this to something like a single reserve, error check, and
memcpy
for the entire slice.I'd like to contribute a change analogous to the one recently merged in #229 but for
&[u8]
instead of&str
. Before I get started, I wanted to check whether 1) you'd be interested in merging such an optimization and 2) what the API should look like (since we don't have specialization to customizeextend_from_slice
forVec<'_, u8>
).