Closed josephg closed 2 years ago
Unrelated but have you seen https://github.com/josephg/jumprope-rs/issues/1 (splitting str_utils into a new crate)?
Ah! Thanks for the tip! I'm certainly no expert in SIMD optimization either, but the change looks correct to me.
If you want to submit a PR for it, I'll happily merge it. Otherwise I'll make the change myself when I get around to it.
(Also, I second @archseer's poke to please look at the linked issue on jumprope. :-) )
Hmmm, I've just been playing around - and I'm gonna put some notes here for reference later.
count_chars_in_bytes
:simdutf_really_inline size_t count_code_points(const char* in, size_t size) {
size_t pos = 0;
size_t count = 0;
for(;pos + 64 <= size; pos += 64) {
simd8x64<int8_t> input(reinterpret_cast<const int8_t *>(in + pos));
uint64_t utf8_continuation_mask = input.lt(-65 + 1);
count += 64 - count_ones(utf8_continuation_mask);
}
return count + scalar::utf8::count_code_points(in + pos, size - pos);
}
I'm confused by a few things - but mostly, why aren't they pulling off the first unaligned section of bytes?
count_chars_in_bytes
to see what can be done. Tweaking the inner loop seems to make almost no difference to the criterion benchmark result - which is weird! I suspect with this data set / benchmark data we're running into memory bandwidth bottlenecks or something like that. The current code runs the benchmark in 41.3 ns. If I swap the complex simd implementation with this:#[inline(always)]
fn count_chars_naive(text: &[u8]) -> usize {
let mut inv_count = 0;
for byte in text.iter() {
inv_count += ((byte & 0xC0) == 0x80) as usize;
}
text.len() - inv_count
}
Then performance is essentially unchanged, at 42.2ns. But maybe we shouldn't be too surprised at that - it looks like llvm is generating some pretty fancy code itself to try and optimize this. Here's the code on godbolt.
So if we want to make this code any faster, it might also be worth pulling in some different benchmark data - like, maybe the stuff simdutf uses which is the contents of the "mars" wikipedia page text in a bunch of languages.
Alternately, if llvm is going to go to so much work optimizing this code anyway, I might be able to just use the naive implementation in jumprope and leave vectorization to the compiler.
Its using some AVX instructions which are still only available in unsafe rust
SSE is also unsafe--basically all the SIMD intrinsics are. I actually implemented an AVX version in str_utils
as well, but decided to leave it out for basically two reasons:
I'm confused by a few things - but mostly, why aren't they pulling off the first unaligned section of bytes?
Hmm. My guess is that they're taking advantage of unaligned reads into the SIMD registers (which I think is a thing?). We could potentially do that as well, but it would require reading from raw pointers, I think...? Which would be more unsafe code. Ideally I'd like to minimize unsafe code unless it really pulls its weight performance-wise.
So if we want to make this code any faster, it might also be worth pulling in some different benchmark data - like, maybe the stuff simdutf uses which is the contents of the "mars" wikipedia page text in a bunch of languages.
Yeah, after splitting str_utils
into a separate crate, setting up a bunch more benchmarks is a great idea. I did something similar in a PR to the unicode-segmentation
crate.
Its using some AVX instructions which are still only available in unsafe rust
Sorry; its using some AVX instructions which are only available in unstable / nightly rust.
And yeah I understand. I think the other reason it doesn't necessarily make sense to use 256/512 bit registers is that the strings I'm scanning are almost never that long anyway. In jumprope the sweet spot for performance seems to be around 300-400 characters per leaf node. Given the average leaf will be half that length (200) then split in 2 (100) by the gap buffer, you can see how it wouldn't necessarily get much benefit from scanning 64 characters per iteration of the loop.
Pulling it all out into a separate library sounds great.
If you go down that road, I'm very curious what gap buffers to ropey's leaf nodes would do to performance.
Sorry; its using some AVX instructions which are only available in unstable / nightly rust.
Ah, got it. I think the 256 instructions are stable, though. Or at least, a lot of them. But yeah, the 512 intrinsics are certainly still unstable.
you can see how it wouldn't necessarily get much benefit from scanning 64 characters per iteration of the loop.
Yeah, agreed. This is one of those situations where the "fastest possible implementation" in the general case doesn't necessarily turn out to be the best for a more specific use case. It's part of the reason I didn't just pull in e.g. the bytecount
crate for Ropey (that, and it wouldn't have been able to handle all the unicode line endings anyway).
If you go down that road, I'm very curious what gap buffers to ropey's leaf nodes would do to performance.
Someone made an issue (#45) about that a while ago. And I was reminded of that when looking at your Rust implementation of jumprope. Based on the performance you've gotten out of it, I may revisit the idea. But I do think the current implementation of Ropey has already spent its complexity budget (so to speak). So exploring that will probably need to wait until a hypothetical Ropey 2.0 in the future.
str_utils
has now been split out into a separate crate:
https://crates.io/crates/str_indices
I only did very minimal clean up, so it's still pretty messy as a stand-alone crate. But should be good enough for now.
I've now cleaned up the code and API in the new crate, and added basic documentation.
@josephg @archseer I've also invited both of you as owners of the crate, for better bus factor.
Closing in favor of https://github.com/cessen/str_indices/issues/2
Thanks for splitting that code out! Much appreciated :heart:
You're welcome! Incidentally, in your commit message when switching jumprope-rs over you said:
Swapped from inlined str tools to str_indices. Small perf improvement for some reason
Just to clear up the mystery: it's because in addition to splitting it out, I also ended up doing some additional perf work on it. So str_indices
actually is faster than what was in Ropey before.
Ah awesome! Its another ~4% or so perf boost. I was worried the opposite would happen, since the compiler has a harder time optimizing across crate boundaries. Nice work :D
@josephg
In jumprope-rs's fast_str_tools.rs source file you're using some simpler alternatives to the str_indices
functions for some of your functions, with comments saying that they're faster. Have you checked if that's still the case?
I tested your functions out in the str_indices
benchmarks, and they were consistently outperformed there (with the exception of a couple cases where the compiler had obviously managed to "optimize" them down to constants). But that doesn't necessarily translate to usage in a more complex code base, so I'm curious.
I'm just running my benchmarks again now and yeah, I'm seeing a few % better performance using the newer version of str_indices::utf16::count_surrogates.
But your str_indices::chars::count() method loses me 2-10% performance across the board compared to the compiler's version. No idea why. The compiler's version of this method is mega fancy though, which probably helps. Or take a look at count_utf16_surrogates_in_bytes.
I also don't have a lot of confidence in my benchmarks at the moment because I don't have any real-world example data sets which aren't English. So all my real testing data is full of ascii characters and not much else. And I'm skipping a lot of calls to these methods when you're just using ASCII. Maybe it would be worth adding another test where I replace all the characters in my dataset with japanese characters and see what happens.
Another interesting thing I noticed: The naive versions of those methods were consistently faster in my tests compared to the non-SIMD versions of the same functions you have in str_indices. And by a lot - they were like 2-3x faster. And all else being equal, if we can lean on autovectorization, we should since it should output better wasm code. (Though I haven't (yet) tried to run criterion benchmarks on wasm output.)
I'm curious - try running your benchmarks again with the SIMD versions disabled and see how performance like that compares to my naive implementations of those methods. If you see the same thing I do (which is, naive code outperforming the <usize>
variants) then it might be worth using naive versions of those methods in str_indices instead when x86 SIMD isn't available.
chars::count
: This recently came up in https://github.com/cessen/str_indices/issues/8, nightly has an optimized implementation now: https://github.com/rust-lang/rust/pull/90414
Based on the results though, the new version of str_indices
is supposedly faster (on M1 at least)
Huh... I'll try rerunning the same experiment on my M1 laptop. I've been testing everything else on a ryzen 5800 and I bet the results are super different. It certainly is with wasm - llvm will happily vectorize the naive version of chars::count in wasm.
I think the difference is string length. A lot of my calls to chars::count are when I'm measuring inserted text. The traces I'm using usually insert very short strings (one of my test cases only has single character insertions). I think the naive code is faster in that case. When I buffer / aggregate changes before calling insert(), the str_indices' simd versions are better.
... And if its relevant, I compile everything with RUSTFLAGS="-C target-cpu=native"
Okay, so after some more investigation, it looks like this is indeed the issue:
The traces I'm using usually insert very short strings (one of my test cases only has single character insertions).
For something like a text editor, single characters are a super common case. So I've now added that to str_indices
's benchmarks as well. Here are the results (with the "jumprope" variant being the simple version you had in jumprope-rs, and the 0001 etc. being the length of the test string in bytes):
chars_english_0001/count: [2.9808 ns 2.9914 ns 3.0008 ns]
chars_english_0001/count_jumprope: [1.3983 ns 1.4038 ns 1.4091 ns]
chars_english_0010/count: [7.6672 ns 7.6965 ns 7.7249 ns]
chars_english_0010/count_jumprope: [7.8138 ns 7.8402 ns 7.8694 ns]
chars_english_0100/count: [10.280 ns 10.321 ns 10.361 ns]
chars_english_0100/count_jumprope: [18.417 ns 18.477 ns 18.544 ns]
chars_english_1000/count: [30.181 ns 30.305 ns 30.433 ns]
chars_english_1000/count_jumprope: [152.51 ns 152.57 ns 152.62 ns]
chars_japanese_0003/count: [4.2629 ns 4.2799 ns 4.2968 ns]
chars_japanese_0003/count_jumprope: [3.6291 ns 3.6449 ns 3.6598 ns]
chars_japanese_0100/count: [11.651 ns 11.699 ns 11.743 ns]
chars_japanese_0100/count_jumprope: [19.827 ns 19.903 ns 19.982 ns]
chars_japanese_1000/count: [30.775 ns 30.890 ns 30.999 ns]
chars_japanese_1000/count_jumprope: [153.82 ns 154.27 ns 154.79 ns]
For single characters (en 0001 and jp 0003) the simpler code path wins out. But as the strings get longer, the simd variant overtakes it in performance.
I noticed that you're now special-casing single-byte strings in jumprope-rs, so I presume you figured this out as well. Following your lead, and special-casing single-byte strings, as well as going a bit further and also special-casing strings shorter than simd chunks, the new results are:
chars_english_0001/count: [307.45 ps 308.60 ps 309.79 ps]
chars_english_0010/count: [6.2815 ns 6.3065 ns 6.3310 ns]
chars_english_0100/count: [9.8208 ns 9.8456 ns 9.8660 ns]
chars_english_1000/count: [30.818 ns 30.951 ns 31.138 ns]
chars_japanese_0003/count: [2.9778 ns 2.9914 ns 3.0055 ns]
chars_japanese_0100/count: [11.102 ns 11.145 ns 11.185 ns]
chars_japanese_1000/count: [30.956 ns 31.094 ns 31.224 ns]
So that's a pretty obvious win!
I suspect similar optimizations would benefit the other functions as well, so I'll work on that and then push these changes. I'll post here again when that's done, so you can test.
@josephg I've pushed the aforementioned optimizations to str_indices master. If you wouldn't mind testing your benchmarks with that to see if it narrows or eliminates the performance gap with your current in-tree implementation, that would be great. Thanks!
Awesome - will do! Yeah adding that simple case for 0 & 1 byte strings and then using the simd version for 2+ bytes improved benchmark performance across the board. Well, at least on my M1. I need an easier way to test across both my computers.
I’ll have a play with the new str_indices code on Monday. Thanks for the tweak!
Great! Looking forward to hearing back.
Also, thanks for helping out with this, and for the idea of specializing for short strings.
I don't know how much performance here really matters in practice, but I find it super fun making performance numbers go up. Its fun to have someone else to bounce these ideas off :D
So my old code was this:
// str_indices = "0.3.1"
#[inline]
pub(crate) fn count_chars_in_bytes(text: &[u8]) -> usize {
if text.len() <= 1 { text.len() }
else { unsafe { str_indices::chars::count(std::str::from_utf8_unchecked(text)) } }
}
New code:
// str_indices = { git = "https://github.com/cessen/str_indices" }
#[inline]
pub(crate) fn count_chars_in_bytes(text: &[u8]) -> usize {
unsafe { str_indices::chars::count(std::str::from_utf8_unchecked(text)) }
}
And performance went down by ~5% in the normal tests:
... And it looks like I get that performance regression just moving from 0.3.1 to master regardless of my code.
If you want, you can run my benchmarks yourself. Just check out jumprope and run:
RUSTFLAGS='-C target-cpu=native' cargo criterion --no-run && sleep 10 && RUSTFLAGS='-C target-cpu=native' taskset 0x1 nice -10 cargo criterion -- --measurement-time=10
(Take out the tasksel
call on macos)
Transferred the issue to str_indices
since it's no longer directly relevant to Ropey.
And re-opening to track progress on this.
@josephg
I did benchmarking on my system using the command you gave, but I got essentially the opposite result that you did: using the str_indices
master-branch functions directly gives a notable performance boost.
Here's the current jumprope-rs master:
testdata/direct/automerge-paper
time: [9.8263 ms 9.8480 ms 9.8699 ms]
thrpt: [26.320 Melem/s 26.379 Melem/s 26.437 Melem/s]
testdata/buffered/automerge-paper
time: [929.02 us 929.56 us 930.19 us]
thrpt: [279.28 Melem/s 279.46 Melem/s 279.63 Melem/s]ed.
Benchmarking testdata/direct/rustcode: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 10.0s. You may wish to increase target time to 10.7s, enable flat sampling, or reduce sample count to 60.
testdata/direct/rustcode
time: [2.1165 ms 2.1183 ms 2.1200 ms]
thrpt: [462.20 Melem/s 462.58 Melem/s 462.97 Melem/s]
testdata/buffered/rustcode
time: [1.1869 ms 1.1875 ms 1.1880 ms]
thrpt: [824.78 Melem/s 825.16 Melem/s 825.56 Melem/s]
testdata/direct/sveltecomponent
time: [822.92 us 823.99 us 825.54 us]
thrpt: [205.34 Melem/s 205.73 Melem/s 205.99 Melem/s]
testdata/buffered/sveltecomponent
time: [481.49 us 481.85 us 482.41 us]
thrpt: [351.40 Melem/s 351.80 Melem/s 352.07 Melem/s]d.
testdata/direct/seph-blog1
time: [5.8280 ms 5.8386 ms 5.8495 ms]
thrpt: [62.948 Melem/s 63.066 Melem/s 63.181 Melem/s]
testdata/buffered/seph-blog1
time: [1.4044 ms 1.4051 ms 1.4058 ms]
thrpt: [261.93 Melem/s 262.07 Melem/s 262.19 Melem/s]
And here's using str_indices
directly (my diff):
testdata/direct/automerge-paper
time: [8.8974 ms 8.9312 ms 8.9675 ms]
thrpt: [28.969 Melem/s 29.087 Melem/s 29.197 Melem/s]
change:
time: [-9.7231% -9.3096% -8.8700%] (p = 0.00 < 0.05)
thrpt: [+9.7334% +10.265% +10.770%]
Performance has improved.
testdata/buffered/automerge-paper
time: [891.94 us 892.97 us 894.05 us]
thrpt: [290.56 Melem/s 290.91 Melem/s 291.25 Melem/s]
change:
time: [-3.8015% -3.6849% -3.5660%] (p = 0.00 < 0.05)
thrpt: [+3.6978% +3.8259% +3.9518%]
Performance has improved.
testdata/direct/rustcode
time: [1.9402 ms 1.9472 ms 1.9544 ms]
thrpt: [501.37 Melem/s 503.23 Melem/s 505.02 Melem/s]
change:
time: [-8.0181% -7.6477% -7.2564%] (p = 0.00 < 0.05)
thrpt: [+7.8242% +8.2810% +8.7171%]
Performance has improved.
testdata/buffered/rustcode
time: [1.1383 ms 1.1388 ms 1.1393 ms]
thrpt: [860.07 Melem/s 860.44 Melem/s 860.82 Melem/s]
change:
time: [-4.1144% -4.0377% -3.9584%] (p = 0.00 < 0.05)
thrpt: [+4.1216% +4.2076% +4.2909%]
Performance has improved.
testdata/direct/sveltecomponent
time: [746.23 us 746.79 us 747.37 us]
thrpt: [226.82 Melem/s 227.00 Melem/s 227.16 Melem/s]
change:
time: [-9.4873% -9.3365% -9.2025%] (p = 0.00 < 0.05)
thrpt: [+10.135% +10.298% +10.482%]
Performance has improved.
testdata/buffered/sveltecomponent
time: [460.76 us 460.92 us 461.09 us]
thrpt: [367.65 Melem/s 367.78 Melem/s 367.91 Melem/s]
change:
time: [-4.4930% -4.3499% -4.1747%] (p = 0.00 < 0.05)
thrpt: [+4.3566% +4.5477% +4.7044%]
Performance has improved.
testdata/direct/seph-blog1
time: [5.2779 ms 5.2894 ms 5.3010 ms]
thrpt: [69.462 Melem/s 69.615 Melem/s 69.766 Melem/s]
change:
time: [-9.6640% -9.4070% -9.1410%] (p = 0.00 < 0.05)
thrpt: [+10.061% +10.384% +10.698%]
Performance has improved.
testdata/buffered/seph-blog1
time: [1.3486 ms 1.3502 ms 1.3520 ms]
thrpt: [272.35 Melem/s 272.71 Melem/s 273.03 Melem/s]
change:
time: [-3.9445% -3.8368% -3.7281%] (p = 0.00 < 0.05)
thrpt: [+3.8725% +3.9899% +4.1065%]
Performance has improved.
This on an AMD Threadripper 3960X system on Linux (NixOS), using the exact same benchmarking command you gave.
This is a bit unexpected since I basically just copy-pasted your optimization into str_utils
, so I assumed at best it would just break even with your code. But perhaps by using the functions directly, rather than wrapping them like you were doing, the compiler is able to discover some additional optimizations. Dunno.
I'm curious how my diff performs on your system.
Hmm. So the mystery deepens. I reran the benchmarks on my system doing nothing except using str_indices master (no changes to jumprope-rs's code), and I got essentially identical performance improvements.
I think you need to be a bit careful swapping between master and 0.3.1 in cargo because the Cargo.lock file let it lock in / cache the version from git and use that even after you change back to the version in cargo. I had to shake it a bit to make it use the cargo version again.
It could also be one of these situations where subtle binary ordering misaligns jumps or something; and there’s no difference between the versions except magic compiler nonsense. And the optimizer can’t run some of its passes across the crate boundary because each crate is compiled separately.
I think you need to be a bit careful swapping between master and 0.3.1 in cargo because the Cargo.lock file let it lock in / cache the version from git and use that even after you change back to the version in cargo.
I don't think that's what's going on here. I get the same perf difference even if I do cargo clean
and rm Cargo.lock
between benchmark runs.
It could also be one of these situations where subtle binary ordering misaligns jumps or something.
That's possible. I'm not sure how to investigate that, though.
And the optimizer can’t run some of its passes across the crate boundary because each crate is compiled separately.
I don't think that's a factor here. All the public functions in str_indices are #[inline]
, which lets rustc inline across crate boundaries, even in standard non-LTO builds. And all of the private functions are #[inline(always)]
, so they should always get inlined into the public functions (unless something really funky is going on). And on top of that, your release build for jumprope-rs has LTO enabled.
I also want to emphasize that in my runs there doesn't seem to be any perf difference between jumprope-rs master + str_indices master vs jumprope-rs with the diff I linked to:
testdata/direct/automerge-paper
time: [8.9489 ms 8.9736 ms 8.9987 ms]
thrpt: [28.868 Melem/s 28.949 Melem/s 29.029 Melem/s]
change:
time: [-0.4514% -0.1052% +0.2688%] (p = 0.58 > 0.05)
thrpt: [-0.2680% +0.1054% +0.4535%]
No change in performance detected.
testdata/buffered/automerge-paper
time: [888.86 us 889.21 us 889.58 us]
thrpt: [292.02 Melem/s 292.14 Melem/s 292.26 Melem/s]
change:
time: [-0.7300% -0.6488% -0.5676%] (p = 0.00 < 0.05)
thrpt: [+0.5708% +0.6530% +0.7354%]
Change within noise threshold.
testdata/direct/rustcode
time: [1.9248 ms 1.9265 ms 1.9282 ms]
thrpt: [508.18 Melem/s 508.63 Melem/s 509.07 Melem/s]
change:
time: [-0.0168% +0.1911% +0.3882%] (p = 0.07 > 0.05)
thrpt: [-0.3867% -0.1907% +0.0168%]
No change in performance detected.
testdata/buffered/rustcode
time: [1.1388 ms 1.1393 ms 1.1399 ms]
thrpt: [859.63 Melem/s 860.03 Melem/s 860.40 Melem/s]
change:
time: [+0.1183% +0.2181% +0.3155%] (p = 0.00 < 0.05)
thrpt: [-0.3145% -0.2176% -0.1182%]
Change within noise threshold.
testdata/direct/sveltecomponent
time: [744.97 us 745.77 us 746.66 us]
thrpt: [227.03 Melem/s 227.31 Melem/s 227.55 Melem/s]
change:
time: [-0.0365% +0.1529% +0.3374%] (p = 0.13 > 0.05)
thrpt: [-0.3363% -0.1526% +0.0366%]
No change in performance detected.
testdata/buffered/sveltecomponent
time: [459.70 us 459.91 us 460.13 us]
thrpt: [368.41 Melem/s 368.59 Melem/s 368.75 Melem/s]
change:
time: [-0.6885% -0.5794% -0.4833%] (p = 0.00 < 0.05)
thrpt: [+0.4856% +0.5828% +0.6932%]
Change within noise threshold.
testdata/direct/seph-blog1
time: [5.2613 ms 5.2710 ms 5.2807 ms]
thrpt: [69.728 Melem/s 69.857 Melem/s 69.986 Melem/s]
change:
time: [-0.8092% -0.4691% -0.1331%] (p = 0.01 < 0.05)
thrpt: [+0.1333% +0.4713% +0.8158%]
Change within noise threshold.
testdata/buffered/seph-blog1
time: [1.3469 ms 1.3474 ms 1.3480 ms]
thrpt: [273.16 Melem/s 273.28 Melem/s 273.39 Melem/s]
change:
time: [-0.1211% -0.0113% +0.0969%] (p = 0.84 > 0.05)
thrpt: [-0.0968% +0.0113% +0.1213%]
No change in performance detected.
Which is basically what I would expect with successful inlining, since I ported the same optimization you did to str_indices.
So the situation from my perspective is that both of us are running into unexpected things:
So I'm not really sure what to make of all of this yet. If it was just your case in isolation, I'd say that rustc isn't inlining the str_utils functions. And that may still be what's going on, although it's then not totally clear why that's not an issue in my runs.
Out of curiosity, what architecture and OS was your benchmark run on?
Another data point:
@archseer pointed out that we should probably also be confirming rust versions, in addition to architecture. Turns out I was still on 1.58. So I updated to 1.59 and ran the benchmarks again. Here's the comparative benchmark between jumprope-rs master and jumprope-rs with my diff, using rust 1.59:
testdata/direct/automerge-paper
time: [9.9392 ms 9.9667 ms 9.9949 ms]
thrpt: [25.991 Melem/s 26.065 Melem/s 26.137 Melem/s]
change:
time: [-4.2883% -3.9255% -3.5531%] (p = 0.00 < 0.05)
thrpt: [+3.6840% +4.0859% +4.4804%]
Performance has improved.
testdata/buffered/automerge-paper
time: [869.45 us 870.23 us 871.02 us]
thrpt: [298.25 Melem/s 298.52 Melem/s 298.78 Melem/s]
change:
time: [-5.1343% -4.8745% -4.4811%] (p = 0.00 < 0.05)
thrpt: [+4.6913% +5.1243% +5.4122%]
Performance has improved.
Benchmarking testdata/direct/rustcode: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 10.0s. You may wish to increase target time to 10.6s, enable flat sampling, or reduce sample count to 60.
testdata/direct/rustcode
time: [2.0995 ms 2.1034 ms 2.1080 ms]
thrpt: [464.82 Melem/s 465.85 Melem/s 466.70 Melem/s]
change:
time: [+2.0286% +2.2489% +2.4635%] (p = 0.00 < 0.05)
thrpt: [-2.4043% -2.1994% -1.9883%]
Performance has regressed.
testdata/buffered/rustcode
time: [1.1142 ms 1.1158 ms 1.1174 ms]
thrpt: [876.95 Melem/s 878.15 Melem/s 879.44 Melem/s]
change:
time: [-3.9507% -3.8029% -3.6628%] (p = 0.00 < 0.05)
thrpt: [+3.8020% +3.9533% +4.1132%]
Performance has improved.
testdata/direct/sveltecomponent
time: [820.95 us 822.71 us 824.55 us]
thrpt: [205.59 Melem/s 206.05 Melem/s 206.49 Melem/s]
change:
time: [+1.7673% +2.0638% +2.4178%] (p = 0.00 < 0.05)
thrpt: [-2.3607% -2.0220% -1.7366%]
Performance has regressed.
testdata/buffered/sveltecomponent
time: [447.29 us 447.96 us 448.61 us]
thrpt: [377.88 Melem/s 378.42 Melem/s 378.99 Melem/s]
change:
time: [-5.3759% -5.2243% -5.0393%] (p = 0.00 < 0.05)
thrpt: [+5.3068% +5.5122% +5.6814%]
Performance has improved.
testdata/direct/seph-blog1
time: [5.7911 ms 5.8102 ms 5.8307 ms]
thrpt: [63.151 Melem/s 63.375 Melem/s 63.583 Melem/s]
change:
time: [-1.7391% -1.3685% -0.9661%] (p = 0.00 < 0.05)
thrpt: [+0.9756% +1.3875% +1.7699%]
Change within noise threshold.
testdata/buffered/seph-blog1
time: [1.3144 ms 1.3164 ms 1.3181 ms]
thrpt: [279.35 Melem/s 279.72 Melem/s 280.13 Melem/s]
change:
time: [-3.8583% -3.6996% -3.5186%] (p = 0.00 < 0.05)
thrpt: [+3.6469% +3.8417% +4.0132%]
Performance has improved.
Still not the same as yours, @josephg, but is nevertheless notably different from 1.58, and does cause some regressions between jumprope-rs master and my diff. It's also interesting to compare it to the timings of the 1.58 benchmarks, because it looks like 1.59 is notably worse at optimizing in this case.
Also benched on nightly, and it's worse than 1.58 across the board.
I've been playing around with this more. When I mark str_indices's functions as #[inline(always)]
, forcing everything to get inlined, the performance improves to be more-or-less the same as my earlier rust 1.58 results, even on nightly.
So my best guess at this point (emphasis on guess) is that the size of str_indices::chars::count()
is teetering on the edge of what the compiler considered inlineable, so whether it gets inlined or not varies between compiler versions.
If that's the case, then what's going is that when the early-out optimization is in jumprope-rs's wrapper function, that tiny wrapper function gets inlined and so does the early-out optimization. But when that same optimization is directly in `str_indices::chars::count()', then the optimization getting inlined depends on whether the compiler decides to inline that whole function or not.
With that theory in mind, I tried splitting the count_inner()
function into two parts:
#[inline(never)]
to prevent it from bloating the calling function.@josephg I've put that experiment in the encourage_small_inline branch. On my machine, it results in jumprope-rs benchmarking results on nightly that are comparable to forcing everything to be inlined. So if you could test that branch out on your machine as well, I'd be very curious about the results.
However, that branch also results in notably worse benchmarking results for longer (100+ byte) strings in str_indices's own bechmark suite. So I don't think this is actually a good way forward.
I'm beginning to think that it may be best to just leave this optimization out of str_indices itself. Then client code can implement it when it's important, as you currently are in jumprope-rs. Of course, I like the idea of str_indices itself being optimal here, but that may not be possible at the moment in a way that actually gives maximum benefit to the client.
Oh interesting! I didn't expect that, and probably wouldn't have tried it. That makes a lot of sense to me. Thanks for investigating!
And yeah, that makes sense. I'm not actually convinced inlining the char count methods is always appropriate. They impact performance, but so do a lot of other things in the codebase! In jumprope I mark if a leaf only has ASCII characters anyway, and skip calling these methods entirely if thats the case. And in wasm projects, code size does matter a lot.
Anyway, I'm happy to leave this in the "fuzzy compiler choices" territory. A 4% performance difference isn't crazy (especially compared to adding that wrapper around the library for buffered writes.) And jumprope currently doesn't work properly under MIRI, which isn't a huge problem but its annoying. I think fixing that should take priority.
Oh and if you want a library like this for text editors, it might also make sense to add an index by lines like you have in ropey. Lots of more important things to play around with!
Okay, sounds good!
I decided to make a compromise and remove the single-byte-string optimization, but not the more general short-string optimization. It still gives a nice 2x performance boost for very short strings, and is something I think I can generalize more easily to the other functions in str_indices as well, so that it becomes more of a general pattern in the code base.
In any case, please don't hesitate to file issues here if you run into any other performance hiccups, or optimizations you think might make sense to incorporate into str_indices. :-)
Edit: oh! And of course, thanks so much for your help on this!
Oh and if you want a library like this for text editors, it might also make sense to add an index by lines like you have in ropey.
Do you mean jumprope or str_indices? (str_indices already has line indexing functions, so I assume you mean jumprope?)
Awesome sounds good!
And yeah I mean in jumprope.
Hey! I'm using the str_utils code in another project and I was reading the sse code to understand it better.
In ByteChunk for sse2::__m128i the code currently does this:
.. Which is a neat trick, but it makes the "vertical" loop have to build the accumulator every 31 iterations so you don't overflow. I'm no expert at this stuff, but some reading recommended using
PSADBW(x, 0)
("Compute sum of absolute differences") instead to accumulate into the array.So changing the code to this:
This yields a (modest) performance improvement on my ryzen 5800:
Is this code change correct?