chars::count() seems to be slower than std (Apple M1 Pro)

cessen / str_indices

Count and convert between various ways of indexing utf8 string slices.

Apache License 2.0

16 stars 5 forks source link

chars::count() seems to be slower than std (Apple M1 Pro) #8

Closed simonask closed 2 years ago

simonask commented 2 years ago

I did some basic benchmarks comparing str_indices to the iterator-based equivalents in std (see #7), and chars::count() seems to take about twice as long on average.

This is on an Apple M1 Pro CPU, for which there is no SIMD optimization implemented. Unfortunately, the ARM64 equivalents of the SSE2 integer vector types are not available in stable Rust yet, but it's possible that using [u64; 2] as the chunk type instead of usize could hint the optimizer to use vector instructions?

Anyway - thanks for the work on this library, I've found it very useful!

simonask commented 2 years ago

The relevant benchmark results:

chars_english_1000/count                                                                             
                        time:   [153.49 ns 153.60 ns 153.73 ns]
chars_english_1000_std/count                                                                            
                        time:   [42.519 ns 42.535 ns 42.552 ns]

chars_japanese_1000/count                                                                             
                        time:   [153.78 ns 154.33 ns 155.03 ns]
chars_japanese_1000_std/count                                                                            
                        time:   [42.539 ns 42.555 ns 42.572 ns]

cessen commented 2 years ago

Huh, that's odd. On x86_64 Linux (disabling the SIMD impl for comparison) the usize str_indices code is about twice as fast as the standard library implementations in your benchmarks.

Unfortunately, I don't have an M1 to test on. But if you'd like to take a crack at optimizing this, I'd happily accept a PR. Since this seems M1/ARM-specific, probably creating a new ARM64-specific ByteChunk impl would make sense, which can later be migrated to a SIMD implementation once the ARM64 intrinsics are stabilized.

simonask commented 2 years ago

After digging a bit more, I found the explanation:

I was running the benchmark on nightly, which appears to contain a fast path for chars().count() (core::str::count::do_count_chars()), which was added relatively recently. Diving into the disassembly, it does indeed get compiled to SIMD instructions on ARM64, which explains the difference in performance.

Running the benchmark on stable results in str_indices being much faster indeed, by about 25%.

It may be worth looking into the approach taken in nightly, which doesn't seem to involve SIMD intrinsics. Right now it relies on the unstable feature as_chunks, not sure if that makes a difference.

cessen commented 2 years ago

Ah, interesting. Yeah, doing something along the lines of their chunking approach could makes sense. It looks like the intention is to allow the compiler to unroll parts of the loop more reliably...? Which I guess then lets the compiler vectorize it more easily.

My suspicion is that would benefit (or at least not hurt) the x86_64 code path as well, so as long as we can do it in a way that doesn't hurt code clarity too much, I'm all for it.

cessen commented 2 years ago

@simonask I pushed some changes to master that, although not quite the same as the nightly implementation, may nevertheless make things easier for the compiler to vectorize. So if you get the chance, feel free to give it another benchmark run and see if anything has improved.

(I honestly have no idea if it will actually help, but it's worth a shot. And the changes make the code cleaner/more comprehensible IMO anyway, so it was worthwhile regardless.)

simonask commented 2 years ago

Awesome!

Results for english_1000/count on stable:

chars_english_1000/count                        
                        time:   [36.766 ns 36.870 ns 36.975 ns]
chars_english_1000/count_std                        
                        time:   [219.20 ns 219.72 ns 220.24 ns]

And on nightly:

chars_english_1000/count                        
                        time:   [35.357 ns 35.396 ns 35.453 ns]
chars_english_1000/count_std                        
                        time:   [42.195 ns 42.261 ns 42.353 ns]

So that's an x6 speedup against stable, and 20% speedup against nightly. Very nice! :-)

From a cursory look at the disassembly, it looks like the compiler is indeed able to vectorize much more now.

cessen commented 2 years ago

Awesome! I'll close this as fixed, then. Thanks for the report and the help testing!