arduano / simdeez

easy simd
MIT License
332 stars 25 forks source link

Version 2.0.0 (master) does not build with sleef / no_std features enabled #63

Open verpeteren opened 10 months ago

verpeteren commented 10 months ago

@arduano did an overhaul of the SIMD traits. That was a big undertaking and has many nice improvements like operator overloading. I am trying to port my stuff over to use the v2.0.0-dev3 (current master) branch and I noticed that there 3 problems:

  1. it does not build with no_std (cargo build --features "no_std")

  2. it does not build with no_std (cargo build --features "sleef")

  3. it seems that only scalar is exposed. A simple test program with sse2, sse41 or avx2 will fi lead to failed to resolve: could not find 'avx2'in 'simdeez'...^^^^^ could not find 'avx2' in 'simdeez'`.

Here is a simple test program (src/main.rs):

use simdeez::prelude::*;

use simdeez::avx2::*;
use simdeez::scalar::*;
use simdeez::sse2::*;
use simdeez::sse41::*;

// If you want your SIMD function to use use runtime feature detection to call
// the fastest available version, use the simd_runtime_generate macro:

fn main() {
    simd_runtime_generate!(
        fn distance(x1: &[f32], y1: &[f32], x2: &[f32], y2: &[f32]) -> Vec<f32> {
            let mut result: Vec<f32> = Vec::with_capacity(x1.len());
            result.set_len(x1.len()); // for efficiency

            // Set each slice to the same length for iteration efficiency
            let mut x1 = &x1[..x1.len()];
            let mut y1 = &y1[..x1.len()];
            let mut x2 = &x2[..x1.len()];
            let mut y2 = &y2[..x1.len()];
            let mut res = &mut result[..x1.len()];

            // Operations have to be done in terms of the vector width
            // so that it will work with any size vector.
            // the width of a vector type is provided as a constant
            // so the compiler is free to optimize it more.
            // S::Simd::Vf32::WIDTH is a constant, 4 when using SSE, 8 when using AVX2, etc
            while x1.len() >= S::Vf32::WIDTH {
                //load data from your vec into an SIMD value
                let xv1 = S::Vf32::load_from_slice(x1);
                let yv1 = S::Vf32::load_from_slice(y1);
                let xv2 = S::Vf32::load_from_slice(x2);
                let yv2 = S::Vf32::load_from_slice(y2);

                let mut xdiff = xv1 - xv2;
                let mut ydiff = yv1 - yv2;
                xdiff *= xdiff;
                ydiff *= ydiff;
                let distance = (xdiff + ydiff).sqrt();
                // Store the SIMD value into the result vec
                distance.copy_to_slice(&mut res);

                // Move each slice to the next position
                x1 = &x1[S::Vf32::WIDTH..];
                y1 = &y1[S::Vf32::WIDTH..];
                x2 = &x2[S::Vf32::WIDTH..];
                y2 = &y2[S::Vf32::WIDTH..];
                res = &mut res[S::Vf32::WIDTH..];
            }

            // (Optional) Compute the remaining elements. Not necessary if you are sure the length
            // of your data is always a multiple of the maximum S::Simd::Vf32::WIDTH you compile for (4 for SSE, 8 for AVX2, etc).
            // This can be asserted by putting `assert_eq!(x1.len(), 0);` here
            for i in 0..x1.len() {
                let mut xdiff = x1[i] - x2[i];
                let mut ydiff = y1[i] - y2[i];
                xdiff *= xdiff;
                ydiff *= ydiff;
                let distance = (xdiff + ydiff).sqrt();
                res[i] = distance;
            }

            result
        }
    );

    let x1 = vec![0.0, 1.30, 2.3, 4.0];
    let y1 = vec![0.0, 1.30, 2.3, 4.0];
    let x2 = vec![0.0, 1.30, 2.3, 4.0];
    let y2 = vec![0.0, 1.30, 2.3, 4.0];

    //distance_scalar
    //distance<S:Simd>` the generic version of your function
    let got = distance_scalar(x1.as_slice(), y1.as_slice(), x2.as_slice(), y2.as_slice());
    //distance_runtime_select
    //distance_sse2
    //distance_sse41
    //distance_avx
    //distance_avx2
    //distance_runtime_select`  picks the fastest of the above at runtime
}

Please advice on how to assist with these problems

arduano commented 10 months ago

Hi, You mention some valid concerns, along with some things that were intentional design decisions.

  1. Not compiling with no_std is an oversight, I will attempt to fix it soon.
  2. I haven't gotten around to porting sleef, it is a fairly weird library to include, but I know it needs to be done to maintain feature parity with the previous simdeez versions.
  3. That was an intentional design decision. One of the main complaints I see for simdeez is that it's all unsafe, which is obviously caused by the fact that simdeez has no way of knowing whether individual operations are valid or not at runtime. My approach was to hide away any simd instruction set specific calls, and just allow:
    • Scalar (as it is always supported)
    • An general function that always picks the best instruction set, guaranteeing that the calls will be safe
    • Using simd_unsafe_generate_all to generate a list of unsafe functions for each instruction set

I made the structs associated with the instruction sets private, as I can't make using them directly safe, but at the same time I want to keep the API as safe as possible (without needing unsafe blocks literally everywhere).

Does simd_unsafe_generate_all fix the problem for you?

arduano commented 10 months ago

Oh also, the README example is outdated, please look at the examples folder for a functioning example. The sum function there just does basic SIMD addition from one vector into another, although modern CPUs can optimize that at runtime via pipelining just fine, so I also added a function there called simd_get_sum which does SIMD-based string parsing to parse all the numbers in the lists of characters and get the sum of them. Naturally, all of this needs more documentation, I just paused for a bit. I will look at it again soon though (today or next few days).

arduano commented 10 months ago

Ok I'm just looking at the no_std compile error, would you be familiar with how to do neon detection in no_std? It seems like it's possible to do x86 feature detection with no_std, but ARM doesn't seem to support it.

verpeteren commented 10 months ago

related to the feature detection, rust doc mentions:

arduano commented 10 months ago

Sorry, did you sent the above reply correctly? The code block seems empty