cessen / str_indices

Count and convert between various ways of indexing utf8 string slices.
Apache License 2.0
16 stars 5 forks source link

Optimize lines_crlf #21

Open CeleritasCelery opened 1 year ago

CeleritasCelery commented 1 year ago

This change is not ready to be merged because it does not yet support x86. I don't know how to do the operation we need with SSE intrinsics (see #18).

This new approach adds a another method to ByteChunk that lets you shift a value across two vectors. This let's us turn a 2-pass counting algorithm into a single pass. Making this change alone resulted in a 40% speedup.

There is also some loop unrolling, but there was not a large difference in unrolling over a factor of 2.

benchmarks

aarch64 benchmarks ``` lines_crlf::count_breaks/lines_100_LF time: [6.5590 ns 6.5608 ns 6.5632 ns] thrpt: [16.035 GiB/s 16.041 GiB/s 16.045 GiB/s] change: time: [-29.627% -29.504% -29.396%] (p = 0.00 < 0.05) thrpt: [+41.634% +41.851% +42.100%] Performance has improved. lines_crlf::count_breaks/lines_100_CR time: [6.5589 ns 6.5658 ns 6.5773 ns] thrpt: [16.001 GiB/s 16.028 GiB/s 16.045 GiB/s] change: time: [-29.667% -29.542% -29.417%] (p = 0.00 < 0.05) thrpt: [+41.677% +41.928% +42.181%] Performance has improved. lines_crlf::count_breaks/lines_100_CRLF time: [10.238 ns 10.287 ns 10.389 ns] thrpt: [10.578 GiB/s 10.683 GiB/s 10.734 GiB/s] change: time: [-17.826% -17.581% -17.271%] (p = 0.00 < 0.05) thrpt: [+20.877% +21.332% +21.692%] Performance has improved. lines_crlf::count_breaks/lines_1000_LF time: [55.172 ns 55.218 ns 55.277 ns] thrpt: [19.039 GiB/s 19.059 GiB/s 19.075 GiB/s] change: time: [-58.229% -58.153% -58.074%] (p = 0.00 < 0.05) thrpt: [+138.51% +138.97% +139.40%] Performance has improved. lines_crlf::count_breaks/lines_1000_CR time: [55.204 ns 55.294 ns 55.397 ns] thrpt: [18.997 GiB/s 19.033 GiB/s 19.064 GiB/s] change: time: [-57.971% -57.894% -57.818%] (p = 0.00 < 0.05) thrpt: [+137.07% +137.50% +137.93%] Performance has improved. lines_crlf::count_breaks/lines_1000_CRLF time: [58.350 ns 58.377 ns 58.411 ns] thrpt: [18.814 GiB/s 18.825 GiB/s 18.834 GiB/s] change: time: [-57.676% -57.594% -57.517%] (p = 0.00 < 0.05) thrpt: [+135.39% +135.82% +136.27%] Performance has improved. lines_crlf::count_breaks/lines_10000_LF time: [474.85 ns 475.57 ns 476.47 ns] thrpt: [22.087 GiB/s 22.129 GiB/s 22.163 GiB/s] change: time: [-60.927% -60.850% -60.763%] (p = 0.00 < 0.05) thrpt: [+154.86% +155.43% +155.93%] Performance has improved. lines_crlf::count_breaks/lines_10000_CR time: [474.14 ns 474.34 ns 474.60 ns] thrpt: [22.174 GiB/s 22.187 GiB/s 22.196 GiB/s] change: time: [-60.732% -60.666% -60.600%] (p = 0.00 < 0.05) thrpt: [+153.81% +154.23% +154.66%] Performance has improved. lines_crlf::count_breaks/lines_10000_CRLF time: [496.71 ns 496.91 ns 497.17 ns] thrpt: [22.104 GiB/s 22.116 GiB/s 22.125 GiB/s] change: time: [-60.879% -60.813% -60.736%] (p = 0.00 < 0.05) thrpt: [+154.68% +155.19% +155.62%] Performance has improved. lines_crlf::from_byte_idx/lines_100_LF time: [6.6764 ns 6.6804 ns 6.6849 ns] thrpt: [15.743 GiB/s 15.754 GiB/s 15.763 GiB/s] change: time: [-28.329% -28.193% -28.053%] (p = 0.00 < 0.05) thrpt: [+38.991% +39.263% +39.527%] Performance has improved. lines_crlf::from_byte_idx/lines_100_CR time: [6.6766 ns 6.6823 ns 6.6894 ns] thrpt: [15.732 GiB/s 15.749 GiB/s 15.762 GiB/s] change: time: [-28.438% -28.289% -28.135%] (p = 0.00 < 0.05) thrpt: [+39.150% +39.448% +39.739%] Performance has improved. lines_crlf::from_byte_idx/lines_100_CRLF time: [10.398 ns 10.403 ns 10.408 ns] thrpt: [10.559 GiB/s 10.564 GiB/s 10.569 GiB/s] change: time: [-16.452% -16.284% -16.133%] (p = 0.00 < 0.05) thrpt: [+19.237% +19.451% +19.692%] Performance has improved. lines_crlf::from_byte_idx/lines_1000_LF time: [54.964 ns 55.003 ns 55.054 ns] thrpt: [19.116 GiB/s 19.133 GiB/s 19.147 GiB/s] change: time: [-58.536% -58.452% -58.373%] (p = 0.00 < 0.05) thrpt: [+140.23% +140.68% +141.17%] Performance has improved. lines_crlf::from_byte_idx/lines_1000_CR time: [54.934 ns 54.948 ns 54.966 ns] thrpt: [19.146 GiB/s 19.153 GiB/s 19.157 GiB/s] change: time: [-58.211% -58.139% -58.062%] (p = 0.00 < 0.05) thrpt: [+138.45% +138.89% +139.30%] Performance has improved. lines_crlf::from_byte_idx/lines_1000_CRLF time: [58.496 ns 58.511 ns 58.531 ns] thrpt: [18.776 GiB/s 18.782 GiB/s 18.787 GiB/s] change: time: [-57.525% -57.453% -57.373%] (p = 0.00 < 0.05) thrpt: [+134.59% +135.04% +135.43%] Performance has improved. lines_crlf::from_byte_idx/lines_10000_LF time: [474.25 ns 474.40 ns 474.63 ns] thrpt: [22.173 GiB/s 22.184 GiB/s 22.191 GiB/s] change: time: [-60.981% -60.917% -60.851%] (p = 0.00 < 0.05) thrpt: [+155.43% +155.86% +156.29%] Performance has improved. lines_crlf::from_byte_idx/lines_10000_CR time: [474.25 ns 474.41 ns 474.67 ns] thrpt: [22.171 GiB/s 22.183 GiB/s 22.190 GiB/s] change: time: [-60.614% -60.546% -60.479%] (p = 0.00 < 0.05) thrpt: [+153.03% +153.46% +153.90%] Performance has improved. lines_crlf::from_byte_idx/lines_10000_CRLF time: [498.51 ns 499.28 ns 500.21 ns] thrpt: [21.970 GiB/s 22.011 GiB/s 22.045 GiB/s] change: time: [-60.672% -60.594% -60.520%] (p = 0.00 < 0.05) thrpt: [+153.29% +153.77% +154.27%] Performance has improved. lines_crlf::to_byte_idx/lines_100_LF time: [6.3709 ns 6.3828 ns 6.4011 ns] thrpt: [16.441 GiB/s 16.488 GiB/s 16.519 GiB/s] change: time: [-33.821% -33.678% -33.516%] (p = 0.00 < 0.05) thrpt: [+50.412% +50.779% +51.106%] Performance has improved. lines_crlf::to_byte_idx/lines_100_CR time: [6.3680 ns 6.3697 ns 6.3720 ns] thrpt: [16.516 GiB/s 16.522 GiB/s 16.526 GiB/s] change: time: [-35.982% -35.870% -35.759%] (p = 0.00 < 0.05) thrpt: [+55.664% +55.933% +56.205%] Performance has improved. lines_crlf::to_byte_idx/lines_100_CRLF time: [9.9426 ns 9.9496 ns 9.9579 ns] thrpt: [11.036 GiB/s 11.045 GiB/s 11.053 GiB/s] change: time: [-23.824% -23.695% -23.559%] (p = 0.00 < 0.05) thrpt: [+30.820% +31.054% +31.275%] Performance has improved. lines_crlf::to_byte_idx/lines_1000_LF time: [54.693 ns 54.740 ns 54.800 ns] thrpt: [19.204 GiB/s 19.225 GiB/s 19.242 GiB/s] change: time: [-34.536% -34.394% -34.244%] (p = 0.00 < 0.05) thrpt: [+52.077% +52.424% +52.755%] Performance has improved. lines_crlf::to_byte_idx/lines_1000_CR time: [54.769 ns 54.804 ns 54.846 ns] thrpt: [19.188 GiB/s 19.203 GiB/s 19.215 GiB/s] change: time: [-36.013% -35.896% -35.787%] (p = 0.00 < 0.05) thrpt: [+55.731% +55.997% +56.283%] Performance has improved. lines_crlf::to_byte_idx/lines_1000_CRLF time: [58.142 ns 58.164 ns 58.191 ns] thrpt: [18.885 GiB/s 18.894 GiB/s 18.901 GiB/s] change: time: [-35.565% -35.455% -35.343%] (p = 0.00 < 0.05) thrpt: [+54.662% +54.930% +55.194%] Performance has improved. lines_crlf::to_byte_idx/lines_10000_LF time: [473.48 ns 473.64 ns 473.92 ns] thrpt: [22.206 GiB/s 22.219 GiB/s 22.227 GiB/s] change: time: [-39.204% -39.100% -38.995%] (p = 0.00 < 0.05) thrpt: [+63.922% +64.203% +64.484%] Performance has improved. lines_crlf::to_byte_idx/lines_10000_CR time: [473.50 ns 473.67 ns 473.95 ns] thrpt: [22.205 GiB/s 22.218 GiB/s 22.226 GiB/s] change: time: [-39.481% -39.384% -39.290%] (p = 0.00 < 0.05) thrpt: [+64.718% +64.974% +65.238%] Performance has improved. lines_crlf::to_byte_idx/lines_10000_CRLF time: [497.40 ns 497.47 ns 497.55 ns] thrpt: [22.087 GiB/s 22.091 GiB/s 22.094 GiB/s] change: time: [-39.299% -39.184% -39.061%] (p = 0.00 < 0.05) thrpt: [+64.100% +64.430% +64.743%] Performance has improved. ```
cessen commented 1 year ago

Thanks! The general idea seems good (though I haven't reviewed the code closely). I do want to keep the algorithm the same across architectures, however, just to keep the maintenance burden reasonable. So if we can't find a performant way to implement shift_across() on x86_64, we might need to leave this on the table.

CeleritasCelery commented 1 year ago

I tried a basic approach to define it like this

    #[inline(always)]
    fn shift_across(&self, n: Self) -> Self {
        unsafe {
            let n = x86_64::_mm_slli_si128(n, 1);
            let upper = core::mem::transmute::<Self, [u8; Self::SIZE]>(*self);
            let mut lower = core::mem::transmute::<Self, [u8; Self::SIZE]>(n);
            lower[0] = upper[Self::SIZE - 1];
            core::mem::transmute::<[u8; Self::SIZE], Self>(lower)
        }
    }

but that resulted in a major 4x slowdown across all benchmarks. So it is going to need to be a different approach.

CeleritasCelery commented 1 year ago

I no longer have time to work on this and I won't be submitting any more perf PR's for the immediate future. Feel free to close this if you don't want try and make this approach work.

CeleritasCelery commented 4 months ago

@cessen I happened to come across an efficient way to do this on x86. I implemented it and it works well. Benchmarks are below. Throughput has increase 20-80% depending on the benchmark with this single pass approach.

x86_64 benchmarks ``` lines_crlf::count_breaks/lines_100_LF time: [23.226 ns 23.232 ns 23.239 ns] thrpt: [4.5286 GiB/s 4.5299 GiB/s 4.5311 GiB/s] change: time: [-16.945% -16.888% -16.841%] (p = 0.00 < 0.05) thrpt: [+20.252% +20.319% +20.401%] Performance has improved. lines_crlf::count_breaks/lines_100_CR time: [23.226 ns 23.231 ns 23.236 ns] thrpt: [4.5291 GiB/s 4.5301 GiB/s 4.5311 GiB/s] change: time: [-16.881% -16.828% -16.744%] (p = 0.00 < 0.05) thrpt: [+20.111% +20.233% +20.310%] Performance has improved. lines_crlf::count_breaks/lines_100_CRLF time: [30.983 ns 30.990 ns 30.996 ns] thrpt: [3.5455 GiB/s 3.5462 GiB/s 3.5469 GiB/s] change: time: [-9.7291% -9.6790% -9.6373%] (p = 0.00 < 0.05) thrpt: [+10.665% +10.716% +10.778%] Performance has improved. lines_crlf::count_breaks/lines_1000_LF time: [177.77 ns 177.79 ns 177.81 ns] thrpt: [5.9187 GiB/s 5.9194 GiB/s 5.9201 GiB/s] change: time: [-32.601% -32.577% -32.560%] (p = 0.00 < 0.05) thrpt: [+48.281% +48.318% +48.370%] Performance has improved. lines_crlf::count_breaks/lines_1000_CR time: [177.83 ns 177.86 ns 177.89 ns] thrpt: [5.9160 GiB/s 5.9170 GiB/s 5.9179 GiB/s] change: time: [-24.170% -24.130% -24.097%] (p = 0.00 < 0.05) thrpt: [+31.747% +31.805% +31.874%] Performance has improved. lines_crlf::count_breaks/lines_1000_CRLF time: [187.96 ns 187.99 ns 188.02 ns] thrpt: [5.8450 GiB/s 5.8459 GiB/s 5.8469 GiB/s] change: time: [-25.359% -25.333% -25.313%] (p = 0.00 < 0.05) thrpt: [+33.892% +33.929% +33.974%] Performance has improved. lines_crlf::count_breaks/lines_10000_LF time: [1.6073 µs 1.6073 µs 1.6073 µs] thrpt: [6.5475 GiB/s 6.5477 GiB/s 6.5478 GiB/s] change: time: [-32.045% -31.967% -31.889%] (p = 0.00 < 0.05) thrpt: [+46.819% +46.987% +47.156%] Performance has improved. lines_crlf::count_breaks/lines_10000_CR time: [1.6073 µs 1.6075 µs 1.6077 µs] thrpt: [6.5460 GiB/s 6.5469 GiB/s 6.5474 GiB/s] change: time: [-22.743% -22.714% -22.677%] (p = 0.00 < 0.05) thrpt: [+29.327% +29.389% +29.438%] Performance has improved. lines_crlf::count_breaks/lines_10000_CRLF time: [1.6836 µs 1.6837 µs 1.6839 µs] thrpt: [6.5264 GiB/s 6.5270 GiB/s 6.5274 GiB/s] change: time: [-23.364% -23.330% -23.293%] (p = 0.00 < 0.05) thrpt: [+30.367% +30.430% +30.486%] Performance has improved. lines_crlf::from_byte_idx/lines_100_LF time: [23.461 ns 23.465 ns 23.468 ns] thrpt: [4.4843 GiB/s 4.4850 GiB/s 4.4856 GiB/s] change: time: [-17.056% -17.001% -16.947%] (p = 0.00 < 0.05) thrpt: [+20.406% +20.484% +20.564%] Performance has improved. lines_crlf::from_byte_idx/lines_100_CR time: [23.460 ns 23.463 ns 23.466 ns] thrpt: [4.4848 GiB/s 4.4854 GiB/s 4.4860 GiB/s] change: time: [-17.053% -16.992% -16.926%] (p = 0.00 < 0.05) thrpt: [+20.375% +20.470% +20.559%] Performance has improved. lines_crlf::from_byte_idx/lines_100_CRLF time: [31.122 ns 31.128 ns 31.135 ns] thrpt: [3.5297 GiB/s 3.5305 GiB/s 3.5312 GiB/s] change: time: [-9.9060% -9.7928% -9.7091%] (p = 0.00 < 0.05) thrpt: [+10.753% +10.856% +10.995%] Performance has improved. lines_crlf::from_byte_idx/lines_1000_LF time: [178.06 ns 178.09 ns 178.13 ns] thrpt: [5.9081 GiB/s 5.9093 GiB/s 5.9104 GiB/s] change: time: [-32.661% -32.630% -32.605%] (p = 0.00 < 0.05) thrpt: [+48.379% +48.433% +48.503%] Performance has improved. lines_crlf::from_byte_idx/lines_1000_CR time: [178.09 ns 178.13 ns 178.16 ns] thrpt: [5.9069 GiB/s 5.9081 GiB/s 5.9092 GiB/s] change: time: [-24.232% -24.208% -24.189%] (p = 0.00 < 0.05) thrpt: [+31.907% +31.940% +31.981%] Performance has improved. lines_crlf::from_byte_idx/lines_1000_CRLF time: [188.19 ns 188.24 ns 188.30 ns] thrpt: [5.8363 GiB/s 5.8382 GiB/s 5.8395 GiB/s] change: time: [-25.091% -25.054% -24.999%] (p = 0.00 < 0.05) thrpt: [+33.331% +33.429% +33.496%] Performance has improved. lines_crlf::from_byte_idx/lines_10000_LF time: [1.6065 µs 1.6066 µs 1.6069 µs] thrpt: [6.5494 GiB/s 6.5503 GiB/s 6.5509 GiB/s] change: time: [-32.121% -32.038% -31.949%] (p = 0.00 < 0.05) thrpt: [+46.948% +47.141% +47.322%] Performance has improved. lines_crlf::from_byte_idx/lines_10000_CR time: [1.6066 µs 1.6068 µs 1.6071 µs] thrpt: [6.5482 GiB/s 6.5497 GiB/s 6.5506 GiB/s] change: time: [-22.507% -22.465% -22.422%] (p = 0.00 < 0.05) thrpt: [+28.902% +28.974% +29.044%] Performance has improved. lines_crlf::from_byte_idx/lines_10000_CRLF time: [1.6709 µs 1.6710 µs 1.6711 µs] thrpt: [6.5761 GiB/s 6.5767 GiB/s 6.5770 GiB/s] change: time: [-24.019% -23.972% -23.934%] (p = 0.00 < 0.05) thrpt: [+31.464% +31.530% +31.611%] Performance has improved. lines_crlf::to_byte_idx/lines_100_LF time: [27.695 ns 27.703 ns 27.712 ns] thrpt: [3.7977 GiB/s 3.7989 GiB/s 3.8000 GiB/s] change: time: [-22.503% -22.464% -22.430%] (p = 0.00 < 0.05) thrpt: [+28.915% +28.972% +29.038%] Performance has improved. lines_crlf::to_byte_idx/lines_100_CR time: [27.703 ns 27.714 ns 27.728 ns] thrpt: [3.7954 GiB/s 3.7973 GiB/s 3.7988 GiB/s] change: time: [-24.568% -24.497% -24.424%] (p = 0.00 < 0.05) thrpt: [+32.316% +32.445% +32.570%] Performance has improved. lines_crlf::to_byte_idx/lines_100_CRLF time: [38.603 ns 38.621 ns 38.639 ns] thrpt: [2.8442 GiB/s 2.8455 GiB/s 2.8468 GiB/s] change: time: [-28.157% -28.091% -28.024%] (p = 0.00 < 0.05) thrpt: [+38.934% +39.065% +39.192%] Performance has improved. lines_crlf::to_byte_idx/lines_1000_LF time: [220.19 ns 220.27 ns 220.37 ns] thrpt: [4.7757 GiB/s 4.7777 GiB/s 4.7794 GiB/s] change: time: [-37.749% -37.677% -37.608%] (p = 0.00 < 0.05) thrpt: [+60.276% +60.454% +60.640%] Performance has improved. lines_crlf::to_byte_idx/lines_1000_CR time: [220.16 ns 220.21 ns 220.26 ns] thrpt: [4.7779 GiB/s 4.7790 GiB/s 4.7800 GiB/s] change: time: [-41.670% -41.648% -41.626%] (p = 0.00 < 0.05) thrpt: [+71.310% +71.375% +71.437%] Performance has improved. lines_crlf::to_byte_idx/lines_1000_CRLF time: [231.20 ns 231.32 ns 231.46 ns] thrpt: [4.7480 GiB/s 4.7507 GiB/s 4.7534 GiB/s] change: time: [-39.793% -39.715% -39.640%] (p = 0.00 < 0.05) thrpt: [+65.672% +65.877% +66.094%] Performance has improved. lines_crlf::to_byte_idx/lines_10000_LF time: [1.9426 µs 1.9427 µs 1.9427 µs] thrpt: [5.4171 GiB/s 5.4173 GiB/s 5.4174 GiB/s] change: time: [-38.454% -38.407% -38.375%] (p = 0.00 < 0.05) thrpt: [+62.272% +62.357% +62.480%] Performance has improved. lines_crlf::to_byte_idx/lines_10000_CR time: [1.9427 µs 1.9428 µs 1.9430 µs] thrpt: [5.4164 GiB/s 5.4169 GiB/s 5.4172 GiB/s] change: time: [-44.405% -44.392% -44.372%] (p = 0.00 < 0.05) thrpt: [+79.766% +79.829% +79.872%] Performance has improved. lines_crlf::to_byte_idx/lines_10000_CRLF time: [2.0376 µs 2.0381 µs 2.0386 µs] thrpt: [5.3907 GiB/s 5.3922 GiB/s 5.3933 GiB/s] change: time: [-39.446% -39.427% -39.385%] (p = 0.00 < 0.05) thrpt: [+64.977% +65.091% +65.141%] Performance has improved. ```
cessen commented 4 months ago

Awesome, thanks! I'm a little busy with other things right now, but I'll try to get this reviewed and merged soon.