Improve performance when used in rope libraries

josephg commented 2 years ago

Hey! I'm using the str_utils code in another project and I was reading the sse code to understand it better.

In ByteChunk for sse2::__m128i the code currently does this:

    #[inline(always)]
    fn sum_bytes(&self) -> usize {
        const ONES: u64 = std::u64::MAX / 0xFF;
        let tmp = unsafe { std::mem::transmute::<Self, (u64, u64)>(*self) };
        let a = tmp.0.wrapping_mul(ONES) >> (7 * 8);
        let b = tmp.1.wrapping_mul(ONES) >> (7 * 8);
        (a + b) as usize
    }

.. Which is a neat trick, but it makes the "vertical" loop have to build the accumulator every 31 iterations so you don't overflow. I'm no expert at this stuff, but some reading recommended using PSADBW(x, 0) ("Compute sum of absolute differences") instead to accumulate into the array.

So changing the code to this:

    #[inline(always)]
    fn max_acc() -> usize {
        255
    }

    #[inline(always)]
    fn sum_bytes(&self) -> usize {
        unsafe {
            let zero = sse2::_mm_setzero_si128();
            let diff = sse2::_mm_sad_epu8(*self, zero);
            let (low, high) = std::mem::transmute::<Self, (u64, u64)>(diff);
            (low + high) as usize
        }
    }

This yields a (modest) performance improvement on my ryzen 5800:

ropey:master $ taskset 0x1 nice -10 RUSTFLAGS=-C target-cpu=native cargo criterion -- --measurement-time=10 index_convert
   Compiling ropey v1.3.2 (/home/seph/3rdparty/ropey)
    Finished bench [optimized] target(s) in 37.01s
index_convert/byte_to_char                                                                             
                        time:   [41.762 ns 41.799 ns 41.837 ns]
                        change: [-1.0722% -0.8577% -0.6697%] (p = 0.00 < 0.05)
                        Change within noise threshold.
index_convert/byte_to_line                                                                            
                        time:   [103.24 ns 103.25 ns 103.27 ns]
                        change: [+1.1863% +1.2842% +1.3631%] (p = 0.00 < 0.05)
                        Performance has regressed.
index_convert/char_to_byte                                                                             
                        time:   [87.674 ns 87.701 ns 87.730 ns]
                        change: [-1.6249% -1.5190% -1.4211%] (p = 0.00 < 0.05)
                        Performance has improved.
index_convert/char_to_line                                                                            
                        time:   [153.53 ns 153.55 ns 153.57 ns]
                        change: [-1.4996% -1.3924% -1.2970%] (p = 0.00 < 0.05)
                        Performance has improved.
index_convert/line_to_byte                                                                            
                        time:   [143.57 ns 143.65 ns 143.77 ns]
                        change: [-7.6773% -7.5422% -7.3956%] (p = 0.00 < 0.05)
                        Performance has improved.
index_convert/line_to_char                                                                            
                        time:   [143.31 ns 143.34 ns 143.39 ns]
                        change: [-7.9232% -7.8228% -7.7185%] (p = 0.00 < 0.05)
                        Performance has improved.

Is this code change correct?

archseer commented 2 years ago

Unrelated but have you seen https://github.com/josephg/jumprope-rs/issues/1 (splitting str_utils into a new crate)?

cessen commented 2 years ago

Ah! Thanks for the tip! I'm certainly no expert in SIMD optimization either, but the change looks correct to me.

If you want to submit a PR for it, I'll happily merge it. Otherwise I'll make the change myself when I get around to it.

(Also, I second @archseer's poke to please look at the linked issue on jumprope. :-) )

josephg commented 2 years ago

Hmmm, I've just been playing around - and I'm gonna put some notes here for reference later.

For reference, there's a C++ library called simdutf which implements a lot of these methods, and claims to have excellent performance. Its using some AVX instructions which are still only available in unsafe rust, so we might want to steer clear of them. But its a nice reference. Eg, here's their code for count_chars_in_bytes:

simdutf_really_inline size_t count_code_points(const char* in, size_t size) {
    size_t pos = 0;
    size_t count = 0;
    for(;pos + 64 <= size; pos += 64) {
      simd8x64<int8_t> input(reinterpret_cast<const int8_t *>(in + pos));
      uint64_t utf8_continuation_mask = input.lt(-65 + 1);
      count += 64 - count_ones(utf8_continuation_mask);
    }
    return count + scalar::utf8::count_code_points(in + pos, size - pos);
}

I'm confused by a few things - but mostly, why aren't they pulling off the first unaligned section of bytes?

I've also been goofing around with different implementations of count_chars_in_bytes to see what can be done. Tweaking the inner loop seems to make almost no difference to the criterion benchmark result - which is weird! I suspect with this data set / benchmark data we're running into memory bandwidth bottlenecks or something like that. The current code runs the benchmark in 41.3 ns. If I swap the complex simd implementation with this:

#[inline(always)]
fn count_chars_naive(text: &[u8]) -> usize {
    let mut inv_count = 0;

    for byte in text.iter() {
        inv_count += ((byte & 0xC0) == 0x80) as usize;
    }
    text.len() - inv_count
}

Then performance is essentially unchanged, at 42.2ns. But maybe we shouldn't be too surprised at that - it looks like llvm is generating some pretty fancy code itself to try and optimize this. Here's the code on godbolt.

So if we want to make this code any faster, it might also be worth pulling in some different benchmark data - like, maybe the stuff simdutf uses which is the contents of the "mars" wikipedia page text in a bunch of languages.

Alternately, if llvm is going to go to so much work optimizing this code anyway, I might be able to just use the naive implementation in jumprope and leave vectorization to the compiler.

cessen commented 2 years ago

Its using some AVX instructions which are still only available in unsafe rust

SSE is also unsafe--basically all the SIMD intrinsics are. I actually implemented an AVX version in str_utils as well, but decided to leave it out for basically two reasons:

Using AVX instructions causes the CPU clock to throttle on most (all?) processors. It's worth it if you're really crunching a lot of numbers for a sustained period of time (like e.g. in a 3d renderer, video encoding, etc.) because the increased parallelism beats out the lost clock speed. But it's less clear if AVX is a win for something like Ropey which is intended to be part of a larger application (text editor), which may want the clock speed for other things. It still might be a win, but it didn't seem worth it to me to pursue, especially since more significant performance gains can probably be had in other ways anyway. (And your work on jumprope is good evidence of that, ha ha.)
By sticking to SSE2 and lower, Ropey doesn't have to do any fancy conditional CPU feature detection stuff on x86_64 architectures. I like that simplicity, as well as avoiding the conditional branches.

I'm confused by a few things - but mostly, why aren't they pulling off the first unaligned section of bytes?

Hmm. My guess is that they're taking advantage of unaligned reads into the SIMD registers (which I think is a thing?). We could potentially do that as well, but it would require reading from raw pointers, I think...? Which would be more unsafe code. Ideally I'd like to minimize unsafe code unless it really pulls its weight performance-wise.

So if we want to make this code any faster, it might also be worth pulling in some different benchmark data - like, maybe the stuff simdutf uses which is the contents of the "mars" wikipedia page text in a bunch of languages.

Yeah, after splitting str_utils into a separate crate, setting up a bunch more benchmarks is a great idea. I did something similar in a PR to the unicode-segmentation crate.

josephg commented 2 years ago

Its using some AVX instructions which are still only available in unsafe rust

Sorry; its using some AVX instructions which are only available in unstable / nightly rust.

And yeah I understand. I think the other reason it doesn't necessarily make sense to use 256/512 bit registers is that the strings I'm scanning are almost never that long anyway. In jumprope the sweet spot for performance seems to be around 300-400 characters per leaf node. Given the average leaf will be half that length (200) then split in 2 (100) by the gap buffer, you can see how it wouldn't necessarily get much benefit from scanning 64 characters per iteration of the loop.

Pulling it all out into a separate library sounds great.

If you go down that road, I'm very curious what gap buffers to ropey's leaf nodes would do to performance.

cessen commented 2 years ago

Sorry; its using some AVX instructions which are only available in unstable / nightly rust.

Ah, got it. I think the 256 instructions are stable, though. Or at least, a lot of them. But yeah, the 512 intrinsics are certainly still unstable.

you can see how it wouldn't necessarily get much benefit from scanning 64 characters per iteration of the loop.

Yeah, agreed. This is one of those situations where the "fastest possible implementation" in the general case doesn't necessarily turn out to be the best for a more specific use case. It's part of the reason I didn't just pull in e.g. the bytecount crate for Ropey (that, and it wouldn't have been able to handle all the unicode line endings anyway).

If you go down that road, I'm very curious what gap buffers to ropey's leaf nodes would do to performance.

Someone made an issue (#45) about that a while ago. And I was reminded of that when looking at your Rust implementation of jumprope. Based on the performance you've gotten out of it, I may revisit the idea. But I do think the current implementation of Ropey has already spent its complexity budget (so to speak). So exploring that will probably need to wait until a hypothetical Ropey 2.0 in the future.

cessen commented 2 years ago

str_utils has now been split out into a separate crate: https://crates.io/crates/str_indices

I only did very minimal clean up, so it's still pretty messy as a stand-alone crate. But should be good enough for now.

cessen commented 2 years ago

I've now cleaned up the code and API in the new crate, and added basic documentation.

@josephg @archseer I've also invited both of you as owners of the crate, for better bus factor.

cessen commented 2 years ago

Closing in favor of https://github.com/cessen/str_indices/issues/2

josephg commented 2 years ago

Thanks for splitting that code out! Much appreciated :heart:

cessen commented 2 years ago

You're welcome! Incidentally, in your commit message when switching jumprope-rs over you said:

Swapped from inlined str tools to str_indices. Small perf improvement for some reason

Just to clear up the mystery: it's because in addition to splitting it out, I also ended up doing some additional perf work on it. So str_indices actually is faster than what was in Ropey before.

josephg commented 2 years ago

Ah awesome! Its another ~4% or so perf boost. I was worried the opposite would happen, since the compiler has a harder time optimizing across crate boundaries. Nice work :D

cessen commented 2 years ago

@josephg In jumprope-rs's fast_str_tools.rs source file you're using some simpler alternatives to the str_indices functions for some of your functions, with comments saying that they're faster. Have you checked if that's still the case?

I tested your functions out in the str_indices benchmarks, and they were consistently outperformed there (with the exception of a couple cases where the compiler had obviously managed to "optimize" them down to constants). But that doesn't necessarily translate to usage in a more complex code base, so I'm curious.

josephg commented 2 years ago

I'm just running my benchmarks again now and yeah, I'm seeing a few % better performance using the newer version of str_indices::utf16::count_surrogates.

But your str_indices::chars::count() method loses me 2-10% performance across the board compared to the compiler's version. No idea why. The compiler's version of this method is mega fancy though, which probably helps. Or take a look at count_utf16_surrogates_in_bytes.

I also don't have a lot of confidence in my benchmarks at the moment because I don't have any real-world example data sets which aren't English. So all my real testing data is full of ascii characters and not much else. And I'm skipping a lot of calls to these methods when you're just using ASCII. Maybe it would be worth adding another test where I replace all the characters in my dataset with japanese characters and see what happens.

Another interesting thing I noticed: The naive versions of those methods were consistently faster in my tests compared to the non-SIMD versions of the same functions you have in str_indices. And by a lot - they were like 2-3x faster. And all else being equal, if we can lean on autovectorization, we should since it should output better wasm code. (Though I haven't (yet) tried to run criterion benchmarks on wasm output.)

I'm curious - try running your benchmarks again with the SIMD versions disabled and see how performance like that compares to my naive implementations of those methods. If you see the same thing I do (which is, naive code outperforming the <usize> variants) then it might be worth using naive versions of those methods in str_indices instead when x86 SIMD isn't available.

archseer commented 2 years ago

chars::count: This recently came up in https://github.com/cessen/str_indices/issues/8, nightly has an optimized implementation now: https://github.com/rust-lang/rust/pull/90414

Based on the results though, the new version of str_indices is supposedly faster (on M1 at least)

josephg commented 2 years ago

Huh... I'll try rerunning the same experiment on my M1 laptop. I've been testing everything else on a ryzen 5800 and I bet the results are super different. It certainly is with wasm - llvm will happily vectorize the naive version of chars::count in wasm.

I think the difference is string length. A lot of my calls to chars::count are when I'm measuring inserted text. The traces I'm using usually insert very short strings (one of my test cases only has single character insertions). I think the naive code is faster in that case. When I buffer / aggregate changes before calling insert(), the str_indices' simd versions are better.

josephg commented 2 years ago

... And if its relevant, I compile everything with RUSTFLAGS="-C target-cpu=native"

cessen commented 2 years ago

Okay, so after some more investigation, it looks like this is indeed the issue:

The traces I'm using usually insert very short strings (one of my test cases only has single character insertions).

For something like a text editor, single characters are a super common case. So I've now added that to str_indices's benchmarks as well. Here are the results (with the "jumprope" variant being the simple version you had in jumprope-rs, and the 0001 etc. being the length of the test string in bytes):

chars_english_0001/count:           [2.9808 ns 2.9914 ns 3.0008 ns]
chars_english_0001/count_jumprope:  [1.3983 ns 1.4038 ns 1.4091 ns]

chars_english_0010/count:           [7.6672 ns 7.6965 ns 7.7249 ns]
chars_english_0010/count_jumprope:  [7.8138 ns 7.8402 ns 7.8694 ns]

chars_english_0100/count:           [10.280 ns 10.321 ns 10.361 ns]
chars_english_0100/count_jumprope:  [18.417 ns 18.477 ns 18.544 ns]

chars_english_1000/count:           [30.181 ns 30.305 ns 30.433 ns]
chars_english_1000/count_jumprope:  [152.51 ns 152.57 ns 152.62 ns]

chars_japanese_0003/count:          [4.2629 ns 4.2799 ns 4.2968 ns]
chars_japanese_0003/count_jumprope: [3.6291 ns 3.6449 ns 3.6598 ns]

chars_japanese_0100/count:          [11.651 ns 11.699 ns 11.743 ns]
chars_japanese_0100/count_jumprope: [19.827 ns 19.903 ns 19.982 ns]

chars_japanese_1000/count:          [30.775 ns 30.890 ns 30.999 ns]
chars_japanese_1000/count_jumprope: [153.82 ns 154.27 ns 154.79 ns]

For single characters (en 0001 and jp 0003) the simpler code path wins out. But as the strings get longer, the simd variant overtakes it in performance.

I noticed that you're now special-casing single-byte strings in jumprope-rs, so I presume you figured this out as well. Following your lead, and special-casing single-byte strings, as well as going a bit further and also special-casing strings shorter than simd chunks, the new results are:

chars_english_0001/count:           [307.45 ps 308.60 ps 309.79 ps]
chars_english_0010/count:           [6.2815 ns 6.3065 ns 6.3310 ns]
chars_english_0100/count:           [9.8208 ns 9.8456 ns 9.8660 ns]
chars_english_1000/count:           [30.818 ns 30.951 ns 31.138 ns]
chars_japanese_0003/count:          [2.9778 ns 2.9914 ns 3.0055 ns]
chars_japanese_0100/count:          [11.102 ns 11.145 ns 11.185 ns]
chars_japanese_1000/count:          [30.956 ns 31.094 ns 31.224 ns]

So that's a pretty obvious win!

I suspect similar optimizations would benefit the other functions as well, so I'll work on that and then push these changes. I'll post here again when that's done, so you can test.

cessen commented 2 years ago

@josephg I've pushed the aforementioned optimizations to str_indices master. If you wouldn't mind testing your benchmarks with that to see if it narrows or eliminates the performance gap with your current in-tree implementation, that would be great. Thanks!

josephg commented 2 years ago

Awesome - will do! Yeah adding that simple case for 0 & 1 byte strings and then using the simd version for 2+ bytes improved benchmark performance across the board. Well, at least on my M1. I need an easier way to test across both my computers.

I’ll have a play with the new str_indices code on Monday. Thanks for the tweak!

cessen commented 2 years ago

Great! Looking forward to hearing back.

Also, thanks for helping out with this, and for the idea of specializing for short strings.

josephg commented 2 years ago

I don't know how much performance here really matters in practice, but I find it super fun making performance numbers go up. Its fun to have someone else to bounce these ideas off :D

josephg commented 2 years ago

So my old code was this:

// str_indices = "0.3.1"

#[inline]
pub(crate) fn count_chars_in_bytes(text: &[u8]) -> usize {
    if text.len() <= 1 { text.len() }
    else { unsafe { str_indices::chars::count(std::str::from_utf8_unchecked(text)) } }
}

New code:

// str_indices = { git = "https://github.com/cessen/str_indices" }

#[inline]
pub(crate) fn count_chars_in_bytes(text: &[u8]) -> usize {
    unsafe { str_indices::chars::count(std::str::from_utf8_unchecked(text)) }
}

And performance went down by ~5% in the normal tests:

... And it looks like I get that performance regression just moving from 0.3.1 to master regardless of my code.

If you want, you can run my benchmarks yourself. Just check out jumprope and run:

RUSTFLAGS='-C target-cpu=native' cargo criterion --no-run && sleep 10 && RUSTFLAGS='-C target-cpu=native' taskset 0x1 nice -10 cargo criterion  -- --measurement-time=10

(Take out the tasksel call on macos)

cessen commented 2 years ago

Transferred the issue to str_indices since it's no longer directly relevant to Ropey.

And re-opening to track progress on this.

cessen commented 2 years ago

@josephg I did benchmarking on my system using the command you gave, but I got essentially the opposite result that you did: using the str_indices master-branch functions directly gives a notable performance boost.

Here's the current jumprope-rs master:

testdata/direct/automerge-paper                                                                             
                        time:   [9.8263 ms 9.8480 ms 9.8699 ms]
                        thrpt:  [26.320 Melem/s 26.379 Melem/s 26.437 Melem/s]
testdata/buffered/automerge-paper                                                                            
                        time:   [929.02 us 929.56 us 930.19 us]
                        thrpt:  [279.28 Melem/s 279.46 Melem/s 279.63 Melem/s]ed.

Benchmarking testdata/direct/rustcode: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 10.0s. You may wish to increase target time to 10.7s, enable flat sampling, or reduce sample count to 60.
testdata/direct/rustcode                                                                             
                        time:   [2.1165 ms 2.1183 ms 2.1200 ms]
                        thrpt:  [462.20 Melem/s 462.58 Melem/s 462.97 Melem/s]
testdata/buffered/rustcode                                                                            
                        time:   [1.1869 ms 1.1875 ms 1.1880 ms]
                        thrpt:  [824.78 Melem/s 825.16 Melem/s 825.56 Melem/s]

testdata/direct/sveltecomponent                                                                            
                        time:   [822.92 us 823.99 us 825.54 us]
                        thrpt:  [205.34 Melem/s 205.73 Melem/s 205.99 Melem/s]
testdata/buffered/sveltecomponent                                                                            
                        time:   [481.49 us 481.85 us 482.41 us]
                        thrpt:  [351.40 Melem/s 351.80 Melem/s 352.07 Melem/s]d.

testdata/direct/seph-blog1                                                                             
                        time:   [5.8280 ms 5.8386 ms 5.8495 ms]
                        thrpt:  [62.948 Melem/s 63.066 Melem/s 63.181 Melem/s]
testdata/buffered/seph-blog1                                                                            
                        time:   [1.4044 ms 1.4051 ms 1.4058 ms]
                        thrpt:  [261.93 Melem/s 262.07 Melem/s 262.19 Melem/s]

And here's using str_indices directly (my diff):

testdata/direct/automerge-paper                                                                             
                        time:   [8.8974 ms 8.9312 ms 8.9675 ms]
                        thrpt:  [28.969 Melem/s 29.087 Melem/s 29.197 Melem/s]
                 change:
                        time:   [-9.7231% -9.3096% -8.8700%] (p = 0.00 < 0.05)
                        thrpt:  [+9.7334% +10.265% +10.770%]
                        Performance has improved.
testdata/buffered/automerge-paper                                                                            
                        time:   [891.94 us 892.97 us 894.05 us]
                        thrpt:  [290.56 Melem/s 290.91 Melem/s 291.25 Melem/s]
                 change:
                        time:   [-3.8015% -3.6849% -3.5660%] (p = 0.00 < 0.05)
                        thrpt:  [+3.6978% +3.8259% +3.9518%]
                        Performance has improved.

testdata/direct/rustcode                                                                            
                        time:   [1.9402 ms 1.9472 ms 1.9544 ms]
                        thrpt:  [501.37 Melem/s 503.23 Melem/s 505.02 Melem/s]
                 change:
                        time:   [-8.0181% -7.6477% -7.2564%] (p = 0.00 < 0.05)
                        thrpt:  [+7.8242% +8.2810% +8.7171%]
                        Performance has improved.
testdata/buffered/rustcode                                                                            
                        time:   [1.1383 ms 1.1388 ms 1.1393 ms]
                        thrpt:  [860.07 Melem/s 860.44 Melem/s 860.82 Melem/s]
                 change:
                        time:   [-4.1144% -4.0377% -3.9584%] (p = 0.00 < 0.05)
                        thrpt:  [+4.1216% +4.2076% +4.2909%]
                        Performance has improved.

testdata/direct/sveltecomponent                                                                            
                        time:   [746.23 us 746.79 us 747.37 us]
                        thrpt:  [226.82 Melem/s 227.00 Melem/s 227.16 Melem/s]
                 change:
                        time:   [-9.4873% -9.3365% -9.2025%] (p = 0.00 < 0.05)
                        thrpt:  [+10.135% +10.298% +10.482%]
                        Performance has improved.
testdata/buffered/sveltecomponent                                                                            
                        time:   [460.76 us 460.92 us 461.09 us]
                        thrpt:  [367.65 Melem/s 367.78 Melem/s 367.91 Melem/s]
                 change:
                        time:   [-4.4930% -4.3499% -4.1747%] (p = 0.00 < 0.05)
                        thrpt:  [+4.3566% +4.5477% +4.7044%]
                        Performance has improved.

testdata/direct/seph-blog1                                                                             
                        time:   [5.2779 ms 5.2894 ms 5.3010 ms]
                        thrpt:  [69.462 Melem/s 69.615 Melem/s 69.766 Melem/s]
                 change:
                        time:   [-9.6640% -9.4070% -9.1410%] (p = 0.00 < 0.05)
                        thrpt:  [+10.061% +10.384% +10.698%]
                        Performance has improved.
testdata/buffered/seph-blog1                                                                            
                        time:   [1.3486 ms 1.3502 ms 1.3520 ms]
                        thrpt:  [272.35 Melem/s 272.71 Melem/s 273.03 Melem/s]
                 change:
                        time:   [-3.9445% -3.8368% -3.7281%] (p = 0.00 < 0.05)
                        thrpt:  [+3.8725% +3.9899% +4.1065%]
                        Performance has improved.

This on an AMD Threadripper 3960X system on Linux (NixOS), using the exact same benchmarking command you gave.

This is a bit unexpected since I basically just copy-pasted your optimization into str_utils, so I assumed at best it would just break even with your code. But perhaps by using the functions directly, rather than wrapping them like you were doing, the compiler is able to discover some additional optimizations. Dunno.

I'm curious how my diff performs on your system.

cessen commented 2 years ago

Hmm. So the mystery deepens. I reran the benchmarks on my system doing nothing except using str_indices master (no changes to jumprope-rs's code), and I got essentially identical performance improvements.

josephg commented 2 years ago

I think you need to be a bit careful swapping between master and 0.3.1 in cargo because the Cargo.lock file let it lock in / cache the version from git and use that even after you change back to the version in cargo. I had to shake it a bit to make it use the cargo version again.

It could also be one of these situations where subtle binary ordering misaligns jumps or something; and there’s no difference between the versions except magic compiler nonsense. And the optimizer can’t run some of its passes across the crate boundary because each crate is compiled separately.

cessen commented 2 years ago

I think you need to be a bit careful swapping between master and 0.3.1 in cargo because the Cargo.lock file let it lock in / cache the version from git and use that even after you change back to the version in cargo.

I don't think that's what's going on here. I get the same perf difference even if I do cargo clean and rm Cargo.lock between benchmark runs.

It could also be one of these situations where subtle binary ordering misaligns jumps or something.

That's possible. I'm not sure how to investigate that, though.

And the optimizer can’t run some of its passes across the crate boundary because each crate is compiled separately.

I don't think that's a factor here. All the public functions in str_indices are #[inline], which lets rustc inline across crate boundaries, even in standard non-LTO builds. And all of the private functions are #[inline(always)], so they should always get inlined into the public functions (unless something really funky is going on). And on top of that, your release build for jumprope-rs has LTO enabled.

I also want to emphasize that in my runs there doesn't seem to be any perf difference between jumprope-rs master + str_indices master vs jumprope-rs with the diff I linked to:

testdata/direct/automerge-paper                                                                             
                        time:   [8.9489 ms 8.9736 ms 8.9987 ms]
                        thrpt:  [28.868 Melem/s 28.949 Melem/s 29.029 Melem/s]
                 change:
                        time:   [-0.4514% -0.1052% +0.2688%] (p = 0.58 > 0.05)
                        thrpt:  [-0.2680% +0.1054% +0.4535%]
                        No change in performance detected.
testdata/buffered/automerge-paper                                                                            
                        time:   [888.86 us 889.21 us 889.58 us]
                        thrpt:  [292.02 Melem/s 292.14 Melem/s 292.26 Melem/s]
                 change:
                        time:   [-0.7300% -0.6488% -0.5676%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5708% +0.6530% +0.7354%]
                        Change within noise threshold.

testdata/direct/rustcode                                                                            
                        time:   [1.9248 ms 1.9265 ms 1.9282 ms]
                        thrpt:  [508.18 Melem/s 508.63 Melem/s 509.07 Melem/s]
                 change:
                        time:   [-0.0168% +0.1911% +0.3882%] (p = 0.07 > 0.05)
                        thrpt:  [-0.3867% -0.1907% +0.0168%]
                        No change in performance detected.
testdata/buffered/rustcode                                                                            
                        time:   [1.1388 ms 1.1393 ms 1.1399 ms]
                        thrpt:  [859.63 Melem/s 860.03 Melem/s 860.40 Melem/s]
                 change:
                        time:   [+0.1183% +0.2181% +0.3155%] (p = 0.00 < 0.05)
                        thrpt:  [-0.3145% -0.2176% -0.1182%]
                        Change within noise threshold.

testdata/direct/sveltecomponent                                                                            
                        time:   [744.97 us 745.77 us 746.66 us]
                        thrpt:  [227.03 Melem/s 227.31 Melem/s 227.55 Melem/s]
                 change:
                        time:   [-0.0365% +0.1529% +0.3374%] (p = 0.13 > 0.05)
                        thrpt:  [-0.3363% -0.1526% +0.0366%]
                        No change in performance detected.
testdata/buffered/sveltecomponent                                                                            
                        time:   [459.70 us 459.91 us 460.13 us]
                        thrpt:  [368.41 Melem/s 368.59 Melem/s 368.75 Melem/s]
                 change:
                        time:   [-0.6885% -0.5794% -0.4833%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4856% +0.5828% +0.6932%]
                        Change within noise threshold.

testdata/direct/seph-blog1                                                                             
                        time:   [5.2613 ms 5.2710 ms 5.2807 ms]
                        thrpt:  [69.728 Melem/s 69.857 Melem/s 69.986 Melem/s]
                 change:
                        time:   [-0.8092% -0.4691% -0.1331%] (p = 0.01 < 0.05)
                        thrpt:  [+0.1333% +0.4713% +0.8158%]
                        Change within noise threshold.
testdata/buffered/seph-blog1                                                                            
                        time:   [1.3469 ms 1.3474 ms 1.3480 ms]
                        thrpt:  [273.16 Melem/s 273.28 Melem/s 273.39 Melem/s]
                 change:
                        time:   [-0.1211% -0.0113% +0.0969%] (p = 0.84 > 0.05)
                        thrpt:  [-0.0968% +0.0113% +0.1213%]
                        No change in performance detected.

Which is basically what I would expect with successful inlining, since I ported the same optimization you did to str_indices.

So the situation from my perspective is that both of us are running into unexpected things:

In your run, it seems odd that you're getting such a significant perf hit when swapping out your inline optimization with str_indices master (which now has the same optimization).
In my run, it seems odd that I'm getting such a significant perf improvement with str_indices master without doing anything else, since the only change was more-or-less porting over your optimization (which was already in your code).

So I'm not really sure what to make of all of this yet. If it was just your case in isolation, I'd say that rustc isn't inlining the str_utils functions. And that may still be what's going on, although it's then not totally clear why that's not an issue in my runs.

Out of curiosity, what architecture and OS was your benchmark run on?

cessen commented 2 years ago

Another data point:

@archseer pointed out that we should probably also be confirming rust versions, in addition to architecture. Turns out I was still on 1.58. So I updated to 1.59 and ran the benchmarks again. Here's the comparative benchmark between jumprope-rs master and jumprope-rs with my diff, using rust 1.59:

testdata/direct/automerge-paper                                                                             
                        time:   [9.9392 ms 9.9667 ms 9.9949 ms]
                        thrpt:  [25.991 Melem/s 26.065 Melem/s 26.137 Melem/s]
                 change:
                        time:   [-4.2883% -3.9255% -3.5531%] (p = 0.00 < 0.05)
                        thrpt:  [+3.6840% +4.0859% +4.4804%]
                        Performance has improved.
testdata/buffered/automerge-paper                                                                            
                        time:   [869.45 us 870.23 us 871.02 us]
                        thrpt:  [298.25 Melem/s 298.52 Melem/s 298.78 Melem/s]
                 change:
                        time:   [-5.1343% -4.8745% -4.4811%] (p = 0.00 < 0.05)
                        thrpt:  [+4.6913% +5.1243% +5.4122%]
                        Performance has improved.

Benchmarking testdata/direct/rustcode: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 10.0s. You may wish to increase target time to 10.6s, enable flat sampling, or reduce sample count to 60.
testdata/direct/rustcode                                                                             
                        time:   [2.0995 ms 2.1034 ms 2.1080 ms]
                        thrpt:  [464.82 Melem/s 465.85 Melem/s 466.70 Melem/s]
                 change:
                        time:   [+2.0286% +2.2489% +2.4635%] (p = 0.00 < 0.05)
                        thrpt:  [-2.4043% -2.1994% -1.9883%]
                        Performance has regressed.
testdata/buffered/rustcode                                                                            
                        time:   [1.1142 ms 1.1158 ms 1.1174 ms]
                        thrpt:  [876.95 Melem/s 878.15 Melem/s 879.44 Melem/s]
                 change:
                        time:   [-3.9507% -3.8029% -3.6628%] (p = 0.00 < 0.05)
                        thrpt:  [+3.8020% +3.9533% +4.1132%]
                        Performance has improved.

testdata/direct/sveltecomponent                                                                            
                        time:   [820.95 us 822.71 us 824.55 us]
                        thrpt:  [205.59 Melem/s 206.05 Melem/s 206.49 Melem/s]
                 change:
                        time:   [+1.7673% +2.0638% +2.4178%] (p = 0.00 < 0.05)
                        thrpt:  [-2.3607% -2.0220% -1.7366%]
                        Performance has regressed.
testdata/buffered/sveltecomponent                                                                            
                        time:   [447.29 us 447.96 us 448.61 us]
                        thrpt:  [377.88 Melem/s 378.42 Melem/s 378.99 Melem/s]
                 change:
                        time:   [-5.3759% -5.2243% -5.0393%] (p = 0.00 < 0.05)
                        thrpt:  [+5.3068% +5.5122% +5.6814%]
                        Performance has improved.

testdata/direct/seph-blog1                                                                             
                        time:   [5.7911 ms 5.8102 ms 5.8307 ms]
                        thrpt:  [63.151 Melem/s 63.375 Melem/s 63.583 Melem/s]
                 change:
                        time:   [-1.7391% -1.3685% -0.9661%] (p = 0.00 < 0.05)
                        thrpt:  [+0.9756% +1.3875% +1.7699%]
                        Change within noise threshold.
testdata/buffered/seph-blog1                                                                            
                        time:   [1.3144 ms 1.3164 ms 1.3181 ms]
                        thrpt:  [279.35 Melem/s 279.72 Melem/s 280.13 Melem/s]
                 change:
                        time:   [-3.8583% -3.6996% -3.5186%] (p = 0.00 < 0.05)
                        thrpt:  [+3.6469% +3.8417% +4.0132%]
                        Performance has improved.

Still not the same as yours, @josephg, but is nevertheless notably different from 1.58, and does cause some regressions between jumprope-rs master and my diff. It's also interesting to compare it to the timings of the 1.58 benchmarks, because it looks like 1.59 is notably worse at optimizing in this case.

cessen commented 2 years ago

Also benched on nightly, and it's worse than 1.58 across the board.

cessen commented 2 years ago

I've been playing around with this more. When I mark str_indices's functions as #[inline(always)], forcing everything to get inlined, the performance improves to be more-or-less the same as my earlier rust 1.58 results, even on nightly.

So my best guess at this point (emphasis on guess) is that the size of str_indices::chars::count() is teetering on the edge of what the compiler considered inlineable, so whether it gets inlined or not varies between compiler versions.

If that's the case, then what's going is that when the early-out optimization is in jumprope-rs's wrapper function, that tiny wrapper function gets inlined and so does the early-out optimization. But when that same optimization is directly in `str_indices::chars::count()', then the optimization getting inlined depends on whether the compiler decides to inline that whole function or not.

With that theory in mind, I tried splitting the count_inner() function into two parts:

The part with the early-out optimization.
The part with the heavy-weight simd code path, marked as #[inline(never)] to prevent it from bloating the calling function.

@josephg I've put that experiment in the encourage_small_inline branch. On my machine, it results in jumprope-rs benchmarking results on nightly that are comparable to forcing everything to be inlined. So if you could test that branch out on your machine as well, I'd be very curious about the results.

However, that branch also results in notably worse benchmarking results for longer (100+ byte) strings in str_indices's own bechmark suite. So I don't think this is actually a good way forward.

I'm beginning to think that it may be best to just leave this optimization out of str_indices itself. Then client code can implement it when it's important, as you currently are in jumprope-rs. Of course, I like the idea of str_indices itself being optimal here, but that may not be possible at the moment in a way that actually gives maximum benefit to the client.

josephg commented 2 years ago

Oh interesting! I didn't expect that, and probably wouldn't have tried it. That makes a lot of sense to me. Thanks for investigating!

And yeah, that makes sense. I'm not actually convinced inlining the char count methods is always appropriate. They impact performance, but so do a lot of other things in the codebase! In jumprope I mark if a leaf only has ASCII characters anyway, and skip calling these methods entirely if thats the case. And in wasm projects, code size does matter a lot.

Anyway, I'm happy to leave this in the "fuzzy compiler choices" territory. A 4% performance difference isn't crazy (especially compared to adding that wrapper around the library for buffered writes.) And jumprope currently doesn't work properly under MIRI, which isn't a huge problem but its annoying. I think fixing that should take priority.

Oh and if you want a library like this for text editors, it might also make sense to add an index by lines like you have in ropey. Lots of more important things to play around with!

cessen commented 2 years ago

Okay, sounds good!

I decided to make a compromise and remove the single-byte-string optimization, but not the more general short-string optimization. It still gives a nice 2x performance boost for very short strings, and is something I think I can generalize more easily to the other functions in str_indices as well, so that it becomes more of a general pattern in the code base.

In any case, please don't hesitate to file issues here if you run into any other performance hiccups, or optimizations you think might make sense to incorporate into str_indices. :-)

Edit: oh! And of course, thanks so much for your help on this!

Oh and if you want a library like this for text editors, it might also make sense to add an index by lines like you have in ropey.

Do you mean jumprope or str_indices? (str_indices already has line indexing functions, so I assume you mean jumprope?)

cessen commented 2 years ago

Also, just published v0.3.2 with the aforementioned changes.

josephg commented 2 years ago

Awesome sounds good!

And yeah I mean in jumprope.

cessen / str_indices

Improve performance when used in rope libraries #9