Open CeleritasCelery opened 1 year ago
Thanks! The general idea seems good (though I haven't reviewed the code closely). I do want to keep the algorithm the same across architectures, however, just to keep the maintenance burden reasonable. So if we can't find a performant way to implement shift_across()
on x86_64, we might need to leave this on the table.
I tried a basic approach to define it like this
#[inline(always)]
fn shift_across(&self, n: Self) -> Self {
unsafe {
let n = x86_64::_mm_slli_si128(n, 1);
let upper = core::mem::transmute::<Self, [u8; Self::SIZE]>(*self);
let mut lower = core::mem::transmute::<Self, [u8; Self::SIZE]>(n);
lower[0] = upper[Self::SIZE - 1];
core::mem::transmute::<[u8; Self::SIZE], Self>(lower)
}
}
but that resulted in a major 4x slowdown across all benchmarks. So it is going to need to be a different approach.
I no longer have time to work on this and I won't be submitting any more perf PR's for the immediate future. Feel free to close this if you don't want try and make this approach work.
@cessen I happened to come across an efficient way to do this on x86. I implemented it and it works well. Benchmarks are below. Throughput has increase 20-80% depending on the benchmark with this single pass approach.
Awesome, thanks! I'm a little busy with other things right now, but I'll try to get this reviewed and merged soon.
This change is not ready to be merged because it does not yet support x86. I don't know how to do the operation we need with SSE intrinsics (see #18).
This new approach adds a another method to ByteChunk that lets you shift a value across two vectors. This let's us turn a 2-pass counting algorithm into a single pass. Making this change alone resulted in a 40% speedup.
There is also some loop unrolling, but there was not a large difference in unrolling over a factor of 2.
benchmarks
aarch64 benchmarks
``` lines_crlf::count_breaks/lines_100_LF time: [6.5590 ns 6.5608 ns 6.5632 ns] thrpt: [16.035 GiB/s 16.041 GiB/s 16.045 GiB/s] change: time: [-29.627% -29.504% -29.396%] (p = 0.00 < 0.05) thrpt: [+41.634% +41.851% +42.100%] Performance has improved. lines_crlf::count_breaks/lines_100_CR time: [6.5589 ns 6.5658 ns 6.5773 ns] thrpt: [16.001 GiB/s 16.028 GiB/s 16.045 GiB/s] change: time: [-29.667% -29.542% -29.417%] (p = 0.00 < 0.05) thrpt: [+41.677% +41.928% +42.181%] Performance has improved. lines_crlf::count_breaks/lines_100_CRLF time: [10.238 ns 10.287 ns 10.389 ns] thrpt: [10.578 GiB/s 10.683 GiB/s 10.734 GiB/s] change: time: [-17.826% -17.581% -17.271%] (p = 0.00 < 0.05) thrpt: [+20.877% +21.332% +21.692%] Performance has improved. lines_crlf::count_breaks/lines_1000_LF time: [55.172 ns 55.218 ns 55.277 ns] thrpt: [19.039 GiB/s 19.059 GiB/s 19.075 GiB/s] change: time: [-58.229% -58.153% -58.074%] (p = 0.00 < 0.05) thrpt: [+138.51% +138.97% +139.40%] Performance has improved. lines_crlf::count_breaks/lines_1000_CR time: [55.204 ns 55.294 ns 55.397 ns] thrpt: [18.997 GiB/s 19.033 GiB/s 19.064 GiB/s] change: time: [-57.971% -57.894% -57.818%] (p = 0.00 < 0.05) thrpt: [+137.07% +137.50% +137.93%] Performance has improved. lines_crlf::count_breaks/lines_1000_CRLF time: [58.350 ns 58.377 ns 58.411 ns] thrpt: [18.814 GiB/s 18.825 GiB/s 18.834 GiB/s] change: time: [-57.676% -57.594% -57.517%] (p = 0.00 < 0.05) thrpt: [+135.39% +135.82% +136.27%] Performance has improved. lines_crlf::count_breaks/lines_10000_LF time: [474.85 ns 475.57 ns 476.47 ns] thrpt: [22.087 GiB/s 22.129 GiB/s 22.163 GiB/s] change: time: [-60.927% -60.850% -60.763%] (p = 0.00 < 0.05) thrpt: [+154.86% +155.43% +155.93%] Performance has improved. lines_crlf::count_breaks/lines_10000_CR time: [474.14 ns 474.34 ns 474.60 ns] thrpt: [22.174 GiB/s 22.187 GiB/s 22.196 GiB/s] change: time: [-60.732% -60.666% -60.600%] (p = 0.00 < 0.05) thrpt: [+153.81% +154.23% +154.66%] Performance has improved. lines_crlf::count_breaks/lines_10000_CRLF time: [496.71 ns 496.91 ns 497.17 ns] thrpt: [22.104 GiB/s 22.116 GiB/s 22.125 GiB/s] change: time: [-60.879% -60.813% -60.736%] (p = 0.00 < 0.05) thrpt: [+154.68% +155.19% +155.62%] Performance has improved. lines_crlf::from_byte_idx/lines_100_LF time: [6.6764 ns 6.6804 ns 6.6849 ns] thrpt: [15.743 GiB/s 15.754 GiB/s 15.763 GiB/s] change: time: [-28.329% -28.193% -28.053%] (p = 0.00 < 0.05) thrpt: [+38.991% +39.263% +39.527%] Performance has improved. lines_crlf::from_byte_idx/lines_100_CR time: [6.6766 ns 6.6823 ns 6.6894 ns] thrpt: [15.732 GiB/s 15.749 GiB/s 15.762 GiB/s] change: time: [-28.438% -28.289% -28.135%] (p = 0.00 < 0.05) thrpt: [+39.150% +39.448% +39.739%] Performance has improved. lines_crlf::from_byte_idx/lines_100_CRLF time: [10.398 ns 10.403 ns 10.408 ns] thrpt: [10.559 GiB/s 10.564 GiB/s 10.569 GiB/s] change: time: [-16.452% -16.284% -16.133%] (p = 0.00 < 0.05) thrpt: [+19.237% +19.451% +19.692%] Performance has improved. lines_crlf::from_byte_idx/lines_1000_LF time: [54.964 ns 55.003 ns 55.054 ns] thrpt: [19.116 GiB/s 19.133 GiB/s 19.147 GiB/s] change: time: [-58.536% -58.452% -58.373%] (p = 0.00 < 0.05) thrpt: [+140.23% +140.68% +141.17%] Performance has improved. lines_crlf::from_byte_idx/lines_1000_CR time: [54.934 ns 54.948 ns 54.966 ns] thrpt: [19.146 GiB/s 19.153 GiB/s 19.157 GiB/s] change: time: [-58.211% -58.139% -58.062%] (p = 0.00 < 0.05) thrpt: [+138.45% +138.89% +139.30%] Performance has improved. lines_crlf::from_byte_idx/lines_1000_CRLF time: [58.496 ns 58.511 ns 58.531 ns] thrpt: [18.776 GiB/s 18.782 GiB/s 18.787 GiB/s] change: time: [-57.525% -57.453% -57.373%] (p = 0.00 < 0.05) thrpt: [+134.59% +135.04% +135.43%] Performance has improved. lines_crlf::from_byte_idx/lines_10000_LF time: [474.25 ns 474.40 ns 474.63 ns] thrpt: [22.173 GiB/s 22.184 GiB/s 22.191 GiB/s] change: time: [-60.981% -60.917% -60.851%] (p = 0.00 < 0.05) thrpt: [+155.43% +155.86% +156.29%] Performance has improved. lines_crlf::from_byte_idx/lines_10000_CR time: [474.25 ns 474.41 ns 474.67 ns] thrpt: [22.171 GiB/s 22.183 GiB/s 22.190 GiB/s] change: time: [-60.614% -60.546% -60.479%] (p = 0.00 < 0.05) thrpt: [+153.03% +153.46% +153.90%] Performance has improved. lines_crlf::from_byte_idx/lines_10000_CRLF time: [498.51 ns 499.28 ns 500.21 ns] thrpt: [21.970 GiB/s 22.011 GiB/s 22.045 GiB/s] change: time: [-60.672% -60.594% -60.520%] (p = 0.00 < 0.05) thrpt: [+153.29% +153.77% +154.27%] Performance has improved. lines_crlf::to_byte_idx/lines_100_LF time: [6.3709 ns 6.3828 ns 6.4011 ns] thrpt: [16.441 GiB/s 16.488 GiB/s 16.519 GiB/s] change: time: [-33.821% -33.678% -33.516%] (p = 0.00 < 0.05) thrpt: [+50.412% +50.779% +51.106%] Performance has improved. lines_crlf::to_byte_idx/lines_100_CR time: [6.3680 ns 6.3697 ns 6.3720 ns] thrpt: [16.516 GiB/s 16.522 GiB/s 16.526 GiB/s] change: time: [-35.982% -35.870% -35.759%] (p = 0.00 < 0.05) thrpt: [+55.664% +55.933% +56.205%] Performance has improved. lines_crlf::to_byte_idx/lines_100_CRLF time: [9.9426 ns 9.9496 ns 9.9579 ns] thrpt: [11.036 GiB/s 11.045 GiB/s 11.053 GiB/s] change: time: [-23.824% -23.695% -23.559%] (p = 0.00 < 0.05) thrpt: [+30.820% +31.054% +31.275%] Performance has improved. lines_crlf::to_byte_idx/lines_1000_LF time: [54.693 ns 54.740 ns 54.800 ns] thrpt: [19.204 GiB/s 19.225 GiB/s 19.242 GiB/s] change: time: [-34.536% -34.394% -34.244%] (p = 0.00 < 0.05) thrpt: [+52.077% +52.424% +52.755%] Performance has improved. lines_crlf::to_byte_idx/lines_1000_CR time: [54.769 ns 54.804 ns 54.846 ns] thrpt: [19.188 GiB/s 19.203 GiB/s 19.215 GiB/s] change: time: [-36.013% -35.896% -35.787%] (p = 0.00 < 0.05) thrpt: [+55.731% +55.997% +56.283%] Performance has improved. lines_crlf::to_byte_idx/lines_1000_CRLF time: [58.142 ns 58.164 ns 58.191 ns] thrpt: [18.885 GiB/s 18.894 GiB/s 18.901 GiB/s] change: time: [-35.565% -35.455% -35.343%] (p = 0.00 < 0.05) thrpt: [+54.662% +54.930% +55.194%] Performance has improved. lines_crlf::to_byte_idx/lines_10000_LF time: [473.48 ns 473.64 ns 473.92 ns] thrpt: [22.206 GiB/s 22.219 GiB/s 22.227 GiB/s] change: time: [-39.204% -39.100% -38.995%] (p = 0.00 < 0.05) thrpt: [+63.922% +64.203% +64.484%] Performance has improved. lines_crlf::to_byte_idx/lines_10000_CR time: [473.50 ns 473.67 ns 473.95 ns] thrpt: [22.205 GiB/s 22.218 GiB/s 22.226 GiB/s] change: time: [-39.481% -39.384% -39.290%] (p = 0.00 < 0.05) thrpt: [+64.718% +64.974% +65.238%] Performance has improved. lines_crlf::to_byte_idx/lines_10000_CRLF time: [497.40 ns 497.47 ns 497.55 ns] thrpt: [22.087 GiB/s 22.091 GiB/s 22.094 GiB/s] change: time: [-39.299% -39.184% -39.061%] (p = 0.00 < 0.05) thrpt: [+64.100% +64.430% +64.743%] Performance has improved. ```