arduano / simdeez

easy simd
MIT License
332 stars 25 forks source link

Iterating in the example is not optimal #35

Closed BillyDM closed 1 year ago

BillyDM commented 4 years ago

In the example, you show to iterate like this:

for i in (0..x1.len()).step_by(S::VF32_WIDTH) {
    let xv1 = S::loadu_ps(&x1[i]);
    let yv1 = S::loadu_ps(&y1[i]);
    let xv2 = S::loadu_ps(&x2[i]);
    let yv2 = S::loadu_ps(&y2[i]);
   /// ...
}

However, this is far from optimal. This would be the optimal way:

/// assert that the length is a multiple of the largest possible S:VF32_WIDTH (optional)
debug_assert_eq!(x1.len() % 8, 0);

/// make each slice the same length for efficiency
let mut x1 = &x1[..x1.len()];
let mut y1 = &y1[..x1.len()];
let mut x2 = &x2[..x1.len()];
let mut y2 = &x1[..x1.len()];

/// iterate by slicing
while x1.len() >= S::VF32_WIDTH {
    let xv1 = S::loadu_ps(&x1[0]);
    let yv1 = S::loadu_ps(&y1[0]);
    let xv2 = S::loadu_ps(&x2[0]);
    let yv2 = S::loadu_ps(&y2[0]);

    /// ...

    x1 = &x1[S::VF32_WIDTH..];
    y1 = &y1[S::VF32_WIDTH..];
    x2 = &x2[S::VF32_WIDTH..];
    y2 = &y2[S::VF32_WIDTH..];
}

Running the attached test (in release mode) and profiling it in callgrind, I got these results: Screenshot from 2020-07-05 17-08-44 As you can see, using the slicing method while also making each slice the same length ran almost twice as fast (completed in 2559 instructions) as your example (completed in 4479 instructions).

Of course this will only compute all values if the length of slice x1 is a multiple of S:VF32_WIDTH. So it should be encouraged to assert that all lengths are a multiple of 8 (or 16 if you want to future-proof for AVX512) if the user chooses to use this method. Or maybe something else can be done to allow iterating over non-multiple lengths without the user writing their code twice. simdeez_test.zip

jackmott commented 4 years ago

That is really cool, why is it faster? Is this avoiding bounds checks?

BillyDM commented 4 years ago

I know the making each slice the same length helps because it hints to the compiler that it only needs to check one of the slices on each iteration.

For example, without the same length, the compiler chooses to check the length of each slice individually to avoid indexing one out of range. It will compile to something like:

while x1.len() >= S::VF32_WIDTH >= S::VF32_WIDTH { // x1 is effectively checked here
    /// ...

    x1 = &x1[S::VF32_WIDTH..]; // no need to check since you know that x1.len() >= S::VF32_WIDTH
    y1 = &y1[S::VF32_WIDTH..]; // bounds check
    x2 = &x2[S::VF32_WIDTH..]; // bounds check
    y2 = &y2[S::VF32_WIDTH..]; // bounds check
}

However, by setting them to the same length, the compiler is able to safely check only one of them on each iteration:

let mut x1 = &x1[..x1.len()];
let mut y1 = &y1[..x1.len()]; // bounds check
let mut x2 = &x2[..x1.len()]; // bounds check
let mut y2 = &x1[..x1.len()]; // bounds check

while x1.len() >= S:VF32_WIDTH { // x1 is effectively checked here
    /// ...

    // no need to check bounds since you know every slice's length is >= S::VF32_WIDTH
    x1 = &x1[S::VF32_WIDTH..];
    y1 = &y1[S::VF32_WIDTH..];
    x2 = &x2[S::VF32_WIDTH..];
    y2 = &y2[S::VF32_WIDTH..];
}

As for the iterating by index vs iterating by slices. I'm not sure. My guess is that the .step_by() function is either inefficient, or the compiler isn't able to elid it for some reason. I'm pretty sure that with the slicing method, the compiler elids it to the optimal pointer arithmetic method you would use in C:

const x1_ptr_end = &x1[len];

for (x1_ptr = &x1[0], y1_ptr = &y1[0], x2_ptr = &x2[0], y2_ptr = &y2[0];
     x1_ptr != x1_ptr_end;
     x1_ptr += VF32_WIDTH, y1_ptr += VF32_WIDTH, x2_ptr += VF32_WIDTH, y2_ptr += VF32_WIDTH)
{

}
BillyDM commented 4 years ago

I wrote my own helper functions that simplify making each slice the same length since I use it all the time. I suppose I could also use a macro, but I don't have any experience making macros. I could also write similar helper functions to slice each slice by S::VF32_WIDTH at the end of the loop.

#[inline]
pub fn unify<'a, T>(s1: &'a [T], s2: &'a [T]) -> (&'a [T], &'a [T]) {
    let len = s1.len();
    (&s1[..len], &s2[..len])
}

#[inline]
pub fn unify_mut<'a, T>(s1: &'a mut [T], s2: &'a mut [T]) -> (&'a mut [T], &'a mut [T]) {
    let len = s1.len();
    (&mut s1[..len], &mut s2[..len])
}

#[inline]
pub fn unify_mut_imm<'a, T>(s1: &'a mut [T], s2: &'a [T]) -> (&'a mut [T], &'a [T]) {
    let len = s1.len();
    (&mut s1[..len], &s2[..len])
}

#[inline]
pub fn unify_4<'a, T>(
    s1: &'a [T],
    s2: &'a [T],
    s3: &'a [T],
    s4: &'a [T],
) -> (&'a [T], &'a [T], &'a [T], &'a [T]) {
    let len = s1.len();
    (&s1[..len], &s2[..len], &s3[..len], &s4[..len])
}

#[inline]
pub fn unify_4_mut<'a, T>(
    s1: &'a mut [T],
    s2: &'a mut [T],
    s3: &'a mut [T],
    s4: &'a mut [T],
) -> (&'a mut [T], &'a mut [T], &'a mut [T], &'a mut [T]) {
    let len = s1.len();
    (
        &mut s1[..len],
        &mut s2[..len],
        &mut s3[..len],
        &mut s4[..len],
    )
}

#[inline]
pub fn unify_2_mut_2_imm<'a, T>(
    s1: &'a mut [T],
    s2: &'a mut [T],
    s3: &'a [T],
    s4: &'a [T],
) -> (&'a mut [T], &'a mut [T], &'a [T], &'a [T]) {
    let len = s1.len();
    (&mut s1[..len], &mut s2[..len], &s3[..len], &s4[..len])
}

#[inline]
pub fn unify_2_mut_1_imm<'a, T>(
    s1: &'a mut [T],
    s2: &'a mut [T],
    s3: &'a [T],
) -> (&'a mut [T], &'a mut [T], &'a [T]) {
    let len = s1.len();
    (&mut s1[..len], &mut s2[..len], &s3[..len])
}

Say if I have 2 mutable and 2 immutable slices, I would use it like this:

let (mut s1, mut s2, s3, s4) = unify_2_mut_2_imm(s1, s2, s3, s4);
BillyDM commented 4 years ago

BTW I learned about the iterating by slicing method here: https://users.rust-lang.org/t/how-to-zip-two-slices-efficiently/2048

Note that Rust is able to elid for i in 0..len to pointer arithmitic, but not for i in (0..len).step_by() for some reason.

jackmott commented 4 years ago

if you want to do a PR changing the docs/readme to show the better way I'll happily merge it.

BillyDM commented 4 years ago

Added a pull request!

Also, I found a small mistake with the test I did, and now the optimal method completed the test in only 1670 instructions! This is now the most optimal the test function can be:

simd_runtime_generate!(
fn distance_iter_by_slicing_w_same_len(
    x1: &[f32],
    y1: &[f32],
    x2: &[f32],
    y2: &[f32]) -> Vec<f32> {

    debug_assert!(x1.len() % 8 == 0);

    let mut result: Vec<f32> = Vec::with_capacity(x1.len());
    result.set_len(x1.len()); // for efficiency

    let mut x1 = &x1[..x1.len()];
    let mut y1 = &y1[..x1.len()];
    let mut x2 = &x2[..x1.len()];
    let mut y2 = &y2[..x1.len()];
    let mut res = &mut result[..x1.len()];

    // Operations have to be done in terms of the vector width
    // so that it will work with any size vector.
    // the width of a vector type is provided as a constant
    // so the compiler is free to optimize it more.
    // S::VF32_WIDTH is a constant, 4 when using SSE, 8 when using AVX2, etc
    while x1.len() >= S::VF32_WIDTH {
        //load data from your vec into a SIMD value
        let xv1 = S::loadu_ps(&x1[0]);
        let yv1 = S::loadu_ps(&y1[0]);
        let xv2 = S::loadu_ps(&x2[0]);
        let yv2 = S::loadu_ps(&y2[0]);

        // Use the usual intrinsic syntax if you prefer
        let mut xdiff = S::sub_ps(xv1, xv2);
        // Or use operater overloading if you like
        let mut ydiff = yv1 - yv2;
        xdiff *= xdiff;
        ydiff *= ydiff;
        let distance = S::sqrt_ps(xdiff + ydiff);
        // Store the SIMD value into the result vec
        S::storeu_ps(&mut res[0], distance);

        x1 = &x1[S::VF32_WIDTH..];
        y1 = &y1[S::VF32_WIDTH..];
        x2 = &x2[S::VF32_WIDTH..];
        y2 = &y2[S::VF32_WIDTH..];
        res = &mut res[S::VF32_WIDTH..];
    }

    result
});
verpeteren commented 1 year ago

It seems that https://github.com/jackmott/simdeez/pull/37 and https://github.com/jackmott/simdeez/pull/36 are already merged. I checked the example on the README.md page and in lib.rs. I suggest that we can close this issue for now. In the future we might make a set of examples / tutorials: minimal, straigt-forward, fastest