arduano / simdeez

easy simd
MIT License
332 stars 25 forks source link

Trait overhaul (almost a rewrite) #58

Closed arduano closed 1 year ago

arduano commented 1 year ago

The idea started as a way to group similar operations together into traits, which would simplify a lot of other logic such as testing and overloads. So for example, instead of:

Simd::mul_ps, Simd::mul_pd, Simd::mullo_epi32, Simd::mullo_epi64, Simd::mullo_epi16

There's now just

SimdBase::mul, implemented for each Simd primitive

And that lets us implement overloads via macros much easier, and lets us create much better tests too.

Some functions weren't ported across and are still implemented in the Simd trait (to be deleted later). The main thing in common between those functions is that their behavior varies between different architectures. For example, castps_pd acts differently in scalar as opposed to every other simd target. Same with things like shuffle_epi32 which would execute very differently depending on the lane width of the architecture.

Architecture-specific functions like that should be implemented manually by the user of simdeez if they are required, rather than by simdeez itself. Simdeez acts as a general baseline interface that's enough for like 95% of use cases.

The only breaking change in this PR should be the fact that the intrinsic value stored in our primitives is no longer public, instead we need to use an unsafe function to access it. So instead of I16x8(var) it's now I16x8::from_underlying_value(var).

Another major change is that most operational functions (functions that execute on existing values, rather than reading/creating new values) are no longer unsafe. Most of them technically weren't unsafe already due to operator overloading, but it is assumed that if the value was created (unsafely) and already exists then it should be safe to perform any operations on it.

arduano commented 1 year ago

Just noticed there's some missing tests (bitshift, horizontal add), will add those first

arduano commented 1 year ago

@jackmott You might find this interesting, but the floor/ceil operations under SSE2 that do register manipulation actually get optimized away in release mode

So like these functions:

fn floor(self) -> Self {
    unsafe {
        let t1 = _mm_getcsr();
        let t2 = t1 | (1 << 13);
        _mm_setcsr(t2);
        let r = self.round();
        _mm_setcsr(t1);
        r
    }
}

I decided to try running tests in release mode because why not, and suddenly they don't pass anymore

arduano commented 1 year ago

It appears that the original cvt algorithms are fairly incorrect, especially for 64 bit numbers, none of which work correctly at all, so yeah I'll need to either fall back to scalar for them or find an actual working implementation from bit operations.

Also, I've been finding some interesting undefined behavior edgecases in actual instruction implementations, e.g. _mm_cvtps_epi32(val) and _mm256_cvtps_epi32(val) where val is a large positive number becomes -2147483648 instead of 2147483647. While val as i32 becomes 2147483647.