Closed BillyDM closed 1 year ago
That is really cool, why is it faster? Is this avoiding bounds checks?
I know the making each slice the same length helps because it hints to the compiler that it only needs to check one of the slices on each iteration.
For example, without the same length, the compiler chooses to check the length of each slice individually to avoid indexing one out of range. It will compile to something like:
while x1.len() >= S::VF32_WIDTH >= S::VF32_WIDTH { // x1 is effectively checked here
/// ...
x1 = &x1[S::VF32_WIDTH..]; // no need to check since you know that x1.len() >= S::VF32_WIDTH
y1 = &y1[S::VF32_WIDTH..]; // bounds check
x2 = &x2[S::VF32_WIDTH..]; // bounds check
y2 = &y2[S::VF32_WIDTH..]; // bounds check
}
However, by setting them to the same length, the compiler is able to safely check only one of them on each iteration:
let mut x1 = &x1[..x1.len()];
let mut y1 = &y1[..x1.len()]; // bounds check
let mut x2 = &x2[..x1.len()]; // bounds check
let mut y2 = &x1[..x1.len()]; // bounds check
while x1.len() >= S:VF32_WIDTH { // x1 is effectively checked here
/// ...
// no need to check bounds since you know every slice's length is >= S::VF32_WIDTH
x1 = &x1[S::VF32_WIDTH..];
y1 = &y1[S::VF32_WIDTH..];
x2 = &x2[S::VF32_WIDTH..];
y2 = &y2[S::VF32_WIDTH..];
}
As for the iterating by index vs iterating by slices. I'm not sure. My guess is that the .step_by()
function is either inefficient, or the compiler isn't able to elid it for some reason. I'm pretty sure that with the slicing method, the compiler elids it to the optimal pointer arithmetic method you would use in C:
const x1_ptr_end = &x1[len];
for (x1_ptr = &x1[0], y1_ptr = &y1[0], x2_ptr = &x2[0], y2_ptr = &y2[0];
x1_ptr != x1_ptr_end;
x1_ptr += VF32_WIDTH, y1_ptr += VF32_WIDTH, x2_ptr += VF32_WIDTH, y2_ptr += VF32_WIDTH)
{
}
I wrote my own helper functions that simplify making each slice the same length since I use it all the time. I suppose I could also use a macro, but I don't have any experience making macros. I could also write similar helper functions to slice each slice by S::VF32_WIDTH
at the end of the loop.
#[inline]
pub fn unify<'a, T>(s1: &'a [T], s2: &'a [T]) -> (&'a [T], &'a [T]) {
let len = s1.len();
(&s1[..len], &s2[..len])
}
#[inline]
pub fn unify_mut<'a, T>(s1: &'a mut [T], s2: &'a mut [T]) -> (&'a mut [T], &'a mut [T]) {
let len = s1.len();
(&mut s1[..len], &mut s2[..len])
}
#[inline]
pub fn unify_mut_imm<'a, T>(s1: &'a mut [T], s2: &'a [T]) -> (&'a mut [T], &'a [T]) {
let len = s1.len();
(&mut s1[..len], &s2[..len])
}
#[inline]
pub fn unify_4<'a, T>(
s1: &'a [T],
s2: &'a [T],
s3: &'a [T],
s4: &'a [T],
) -> (&'a [T], &'a [T], &'a [T], &'a [T]) {
let len = s1.len();
(&s1[..len], &s2[..len], &s3[..len], &s4[..len])
}
#[inline]
pub fn unify_4_mut<'a, T>(
s1: &'a mut [T],
s2: &'a mut [T],
s3: &'a mut [T],
s4: &'a mut [T],
) -> (&'a mut [T], &'a mut [T], &'a mut [T], &'a mut [T]) {
let len = s1.len();
(
&mut s1[..len],
&mut s2[..len],
&mut s3[..len],
&mut s4[..len],
)
}
#[inline]
pub fn unify_2_mut_2_imm<'a, T>(
s1: &'a mut [T],
s2: &'a mut [T],
s3: &'a [T],
s4: &'a [T],
) -> (&'a mut [T], &'a mut [T], &'a [T], &'a [T]) {
let len = s1.len();
(&mut s1[..len], &mut s2[..len], &s3[..len], &s4[..len])
}
#[inline]
pub fn unify_2_mut_1_imm<'a, T>(
s1: &'a mut [T],
s2: &'a mut [T],
s3: &'a [T],
) -> (&'a mut [T], &'a mut [T], &'a [T]) {
let len = s1.len();
(&mut s1[..len], &mut s2[..len], &s3[..len])
}
Say if I have 2 mutable and 2 immutable slices, I would use it like this:
let (mut s1, mut s2, s3, s4) = unify_2_mut_2_imm(s1, s2, s3, s4);
BTW I learned about the iterating by slicing method here: https://users.rust-lang.org/t/how-to-zip-two-slices-efficiently/2048
Note that Rust is able to elid for i in 0..len
to pointer arithmitic, but not for i in (0..len).step_by()
for some reason.
if you want to do a PR changing the docs/readme to show the better way I'll happily merge it.
Added a pull request!
Also, I found a small mistake with the test I did, and now the optimal method completed the test in only 1670 instructions! This is now the most optimal the test function can be:
simd_runtime_generate!(
fn distance_iter_by_slicing_w_same_len(
x1: &[f32],
y1: &[f32],
x2: &[f32],
y2: &[f32]) -> Vec<f32> {
debug_assert!(x1.len() % 8 == 0);
let mut result: Vec<f32> = Vec::with_capacity(x1.len());
result.set_len(x1.len()); // for efficiency
let mut x1 = &x1[..x1.len()];
let mut y1 = &y1[..x1.len()];
let mut x2 = &x2[..x1.len()];
let mut y2 = &y2[..x1.len()];
let mut res = &mut result[..x1.len()];
// Operations have to be done in terms of the vector width
// so that it will work with any size vector.
// the width of a vector type is provided as a constant
// so the compiler is free to optimize it more.
// S::VF32_WIDTH is a constant, 4 when using SSE, 8 when using AVX2, etc
while x1.len() >= S::VF32_WIDTH {
//load data from your vec into a SIMD value
let xv1 = S::loadu_ps(&x1[0]);
let yv1 = S::loadu_ps(&y1[0]);
let xv2 = S::loadu_ps(&x2[0]);
let yv2 = S::loadu_ps(&y2[0]);
// Use the usual intrinsic syntax if you prefer
let mut xdiff = S::sub_ps(xv1, xv2);
// Or use operater overloading if you like
let mut ydiff = yv1 - yv2;
xdiff *= xdiff;
ydiff *= ydiff;
let distance = S::sqrt_ps(xdiff + ydiff);
// Store the SIMD value into the result vec
S::storeu_ps(&mut res[0], distance);
x1 = &x1[S::VF32_WIDTH..];
y1 = &y1[S::VF32_WIDTH..];
x2 = &x2[S::VF32_WIDTH..];
y2 = &y2[S::VF32_WIDTH..];
res = &mut res[S::VF32_WIDTH..];
}
result
});
It seems that https://github.com/jackmott/simdeez/pull/37 and https://github.com/jackmott/simdeez/pull/36 are already merged. I checked the example on the README.md
page and in lib.rs
.
I suggest that we can close this issue for now.
In the future we might make a set of examples / tutorials: minimal, straigt-forward, fastest
In the example, you show to iterate like this:
However, this is far from optimal. This would be the optimal way:
Running the attached test (in release mode) and profiling it in callgrind, I got these results:
As you can see, using the slicing method while also making each slice the same length ran almost twice as fast (completed in 2559 instructions) as your example (completed in 4479 instructions).
Of course this will only compute all values if the length of slice
x1
is a multiple ofS:VF32_WIDTH
. So it should be encouraged to assert that all lengths are a multiple of8
(or16
if you want to future-proof for AVX512) if the user chooses to use this method. Or maybe something else can be done to allow iterating over non-multiple lengths without the user writing their code twice. simdeez_test.zip