Closed simonask closed 2 years ago
The relevant benchmark results:
chars_english_1000/count
time: [153.49 ns 153.60 ns 153.73 ns]
chars_english_1000_std/count
time: [42.519 ns 42.535 ns 42.552 ns]
chars_japanese_1000/count
time: [153.78 ns 154.33 ns 155.03 ns]
chars_japanese_1000_std/count
time: [42.539 ns 42.555 ns 42.572 ns]
Huh, that's odd. On x86_64 Linux (disabling the SIMD impl for comparison) the usize
str_indices code is about twice as fast as the standard library implementations in your benchmarks.
Unfortunately, I don't have an M1 to test on. But if you'd like to take a crack at optimizing this, I'd happily accept a PR. Since this seems M1/ARM-specific, probably creating a new ARM64-specific ByteChunk
impl would make sense, which can later be migrated to a SIMD implementation once the ARM64 intrinsics are stabilized.
After digging a bit more, I found the explanation:
I was running the benchmark on nightly
, which appears to contain a fast path for chars().count()
(core::str::count::do_count_chars()
), which was added relatively recently. Diving into the disassembly, it does indeed get compiled to SIMD instructions on ARM64, which explains the difference in performance.
Running the benchmark on stable
results in str_indices
being much faster indeed, by about 25%.
It may be worth looking into the approach taken in nightly
, which doesn't seem to involve SIMD intrinsics. Right now it relies on the unstable feature as_chunks
, not sure if that makes a difference.
Ah, interesting. Yeah, doing something along the lines of their chunking approach could makes sense. It looks like the intention is to allow the compiler to unroll parts of the loop more reliably...? Which I guess then lets the compiler vectorize it more easily.
My suspicion is that would benefit (or at least not hurt) the x86_64 code path as well, so as long as we can do it in a way that doesn't hurt code clarity too much, I'm all for it.
@simonask I pushed some changes to master that, although not quite the same as the nightly implementation, may nevertheless make things easier for the compiler to vectorize. So if you get the chance, feel free to give it another benchmark run and see if anything has improved.
(I honestly have no idea if it will actually help, but it's worth a shot. And the changes make the code cleaner/more comprehensible IMO anyway, so it was worthwhile regardless.)
Awesome!
Results for english_1000/count
on stable:
chars_english_1000/count
time: [36.766 ns 36.870 ns 36.975 ns]
chars_english_1000/count_std
time: [219.20 ns 219.72 ns 220.24 ns]
And on nightly:
chars_english_1000/count
time: [35.357 ns 35.396 ns 35.453 ns]
chars_english_1000/count_std
time: [42.195 ns 42.261 ns 42.353 ns]
So that's an x6 speedup against stable, and 20% speedup against nightly. Very nice! :-)
From a cursory look at the disassembly, it looks like the compiler is indeed able to vectorize much more now.
Awesome! I'll close this as fixed, then. Thanks for the report and the help testing!
I did some basic benchmarks comparing
str_indices
to the iterator-based equivalents instd
(see #7), andchars::count()
seems to take about twice as long on average.This is on an Apple M1 Pro CPU, for which there is no SIMD optimization implemented. Unfortunately, the ARM64 equivalents of the SSE2 integer vector types are not available in stable Rust yet, but it's possible that using
[u64; 2]
as the chunk type instead ofusize
could hint the optimizer to use vector instructions?Anyway - thanks for the work on this library, I've found it very useful!