Closed arduano closed 1 year ago
Just noticed there's some missing tests (bitshift, horizontal add), will add those first
@jackmott You might find this interesting, but the floor/ceil operations under SSE2 that do register manipulation actually get optimized away in release mode
So like these functions:
fn floor(self) -> Self {
unsafe {
let t1 = _mm_getcsr();
let t2 = t1 | (1 << 13);
_mm_setcsr(t2);
let r = self.round();
_mm_setcsr(t1);
r
}
}
I decided to try running tests in release mode because why not, and suddenly they don't pass anymore
It appears that the original cvt algorithms are fairly incorrect, especially for 64 bit numbers, none of which work correctly at all, so yeah I'll need to either fall back to scalar for them or find an actual working implementation from bit operations.
Also, I've been finding some interesting undefined behavior edgecases in actual instruction implementations, e.g. _mm_cvtps_epi32(val)
and _mm256_cvtps_epi32(val)
where val is a large positive number becomes -2147483648
instead of 2147483647
. While val as i32
becomes 2147483647
.
The idea started as a way to group similar operations together into traits, which would simplify a lot of other logic such as testing and overloads. So for example, instead of:
Simd::mul_ps
,Simd::mul_pd
,Simd::mullo_epi32
,Simd::mullo_epi64
,Simd::mullo_epi16
There's now just
SimdBase::mul
, implemented for each Simd primitiveAnd that lets us implement overloads via macros much easier, and lets us create much better tests too.
Some functions weren't ported across and are still implemented in the Simd trait (to be deleted later). The main thing in common between those functions is that their behavior varies between different architectures. For example,
castps_pd
acts differently in scalar as opposed to every other simd target. Same with things likeshuffle_epi32
which would execute very differently depending on the lane width of the architecture.Architecture-specific functions like that should be implemented manually by the user of simdeez if they are required, rather than by simdeez itself. Simdeez acts as a general baseline interface that's enough for like 95% of use cases.
The only breaking change in this PR should be the fact that the intrinsic value stored in our primitives is no longer public, instead we need to use an unsafe function to access it. So instead of
I16x8(var)
it's nowI16x8::from_underlying_value(var)
.Another major change is that most operational functions (functions that execute on existing values, rather than reading/creating new values) are no longer unsafe. Most of them technically weren't unsafe already due to operator overloading, but it is assumed that if the value was created (unsafely) and already exists then it should be safe to perform any operations on it.