Add f32xN::recip and f32xN::recip_sqrt

Lokathor / wide

A crate to help you go wide. By which I mean use SIMD stuff.

https://docs.rs/wide

zlib License

279 stars 23 forks source link

Add f32xN::recip and f32xN::recip_sqrt #71

Closed RazrFalcon closed 3 years ago

RazrFalcon commented 3 years ago

Closes #69

RazrFalcon commented 3 years ago

I'm getting different rounding depending on a platform. How should I compare those numbers?

Lokathor commented 3 years ago

I think, if the result is approximately equal to the expected value, within a low tolerance, then it's probably "good enough".

probably 0.0000001 or something? I don't think a tolerance as low as f32::EPSILON will work.

RazrFalcon commented 3 years ago

The current results are significantly different.

i586 - 0.49987793 vs 0.5
i686 - 0.70703125 vs 0.7069092
x86_64 - 0.70703125 vs 0.7069092

Lokathor commented 3 years ago

Seems to be within 0.0001 in all cases. Is that enough precision?

RazrFalcon commented 3 years ago

Is that ok that the results are not identical?

Lokathor commented 3 years ago

I'm confused, are all three cases using the same input value?

I'd understand if the "actual" output changed, but why is the "expected" output changing?

RazrFalcon commented 3 years ago

No, the first one is different only for i586, but not others... Should add some kind of hack there too.

My question is: does your library guarantee exact results for SIMD and scalar code? Or is it even possible with SIMD?

So the results above are:

println!("{:?}", (1.0 / 2.0).sqrt()); // 0.70710677

use core::arch::x86_64::*;
unsafe {
    let n = _mm_rsqrt_ps(_mm_set1_ps(2.0));
    println!("{:?}", std::mem::transmute::<__m128, [f32; 4]>(n)); // [0.7069092, 0.7069092, 0.7069092, 0.7069092]
}

PS: The _mm_rsqrt_ps returns 0.7069092 on Rust Playground, but 0.70703125 locally...

Lokathor commented 3 years ago

Results are not assured to be exact, no. Particularly with recp and recp_sqrt the idea is basically that you're giving up some accuracy for speed.

Lokathor commented 3 years ago

For example:

// actual divsion, slower
let c = a / b;

// mul with reciprocal is intended to be faster than div, but less accurate.
let d = recp(a) * b;

RazrFalcon commented 3 years ago

I see. The _mm_rsqrt_ps doc indeed mentions approximation.

RazrFalcon commented 3 years ago

Kinda done. I've removed Inf/NaN tests eventually, because they are too random.

Lokathor commented 3 years ago

There should be test for Inf I think, though I'd buy an argument that nan can just return nan again or whatever.

RazrFalcon commented 3 years ago

The problem is that I'm having different results depending on a target again... Not sure if I've messed something up.

Lokathor commented 3 years ago

What are the results on which targets?

RazrFalcon commented 3 years ago

~~Looks line it's software_sqrt bug.~~

~~software_sqrt((1.0 / -f32::INFINITY) as f64) as f32 returns -0 instead of NaN.~~

~~UPD: or not... (1.0 / -f32::INFINITY).sqrt() also returns -0. But _mm_rsqrt_ps returns NaN.~~

Whatever... the formula is 1/sqrt(n), not sqrt(1/n)...

RazrFalcon commented 3 years ago

Done.

RazrFalcon commented 3 years ago

Ping?

Lokathor commented 3 years ago

ah! sorry, missed the first message I guess.

RazrFalcon commented 3 years ago

Thanks. No problem.

Lokathor commented 3 years ago

released wide-0.6.1

RazrFalcon commented 3 years ago

tiny-skia uses wide for SIMD handling now. This change removed a lot of unsafe code. And since wide already support some 256 bit types, looks like it a good opportunity to try them out.