Implementation-dependent reciprocal [sqrt] approximation instructions

stoklund commented 7 years ago

The proposal in WebAssembly/simd#1 includes these instructions:

f32x4.reciprocalApproximation(a: v128) -> v128,
f32x4.reciprocalSqrtApproximation(a: v128) -> v128,
f64x2.reciprocalApproximation(a: v128) -> v128, and
f64x2.reciprocalSqrtApproximation(a: v128) -> v128.

The corresponding scalar instructions are mentioned in the future features design document.

These instructions are available in the ARM, Intel, and MIPS SIMD instruction sets. They are typically implemented as table-driven piece-wise linear approximations on the interval [0.5; 1) or [0.25; 1) and then extended to the full exponent domain. However, the exact nature of the approximation is implementation-dependent, so different ISAs will get different results.

The approximations are either used as-is for low-precision graphics or physics code, or they are used as starting points and refined with one or more steps of Newton-Raphson.

jfbastien commented 7 years ago

My main objection to doing this early is that the results differ per platform, which WebAssembly has tried very hard to avoid. These instructions have a history of providing great perf when precision doesn't matter, so I see them as somewhat special.

I think we need to quantify these gains across "tier 1" ISAs, in the context where they're usually employed. We then need to also document what kind of results they return in different implementations so that our spec can provide some bounds (cf. JavaScript's original sin / cos for how not to do this).

stoklund commented 7 years ago

Fortunately, these functions are better behaved than sin/cos, so it should be easy to specify a quite tight maximum relative error along with an allowance for weird results near 0. For example, Intel promises |Relative Error| ≤ 1.5 ∗ 2⁻¹², with subnormal inputs mapped to ±∞.

There should also be less room for software shenanigans since straight up computing 1/x or 1/sqrt(x) in one or two instructions is much faster than computing sin(x).

billbudge commented 7 years ago

A concern I have is that f64x2 is not supported well on platforms other than Intel. Implementations would be forced to lower these opcodes to the equivalent scalar ops, probably leading to zero or negative performance improvement. So I'm wondering if the f64x2 type should be included.

jfbastien commented 7 years ago

@billbudge (hi Bill!) is this specifically about f64x2 approximation instructions, or about f64x2 in general?

billbudge commented 7 years ago

Hi JF. Yes, f64x2 in general. We think it will be hard enough to get clear performance wins with f32x4, so somewhat skeptical about f64x2, especially on platforms like ARM.

jfbastien commented 7 years ago

In ARMv8 A64 supports f64x2. I'm happy with this, but I agree that this type is related to the "portable performance" bar I suggested.

I think it's a separate issue from approximation instructions though. Could you file it?

lemaitre commented 6 years ago

As long as the constraint about the reproducibility of the results across clients is not soften, this question has absolutely no answer. It is not possible to propose such a function that can give the exact same result across architectures. SSE, Neon, Altivec, AVX512: all those ISAs have different instructions giving different results.

But if this constraint disappears, then I think the best way would be to have something like this:

f32x4.reciprocalApproximation(a: v128, prec: int8) -> v128
f32x4.reciprocalSqrtApproximation(a: v128, prec: int8) -> v128
f64x2.reciprocalApproximation(a: v128, prec: int8) -> v128
f64x2.reciprocalSqrtApproximation(a: v128, prec: int8) -> v128 will compute an approximation of f(a) with at least prec bits correct. This instruction will call a fast hardware instruction, and then recover accuracy with some Newton-Raphson iterations to achieve the expected accuracy. This will improve the portability of such a function, but will still give "slightly" different results from different architectures.

gnzlbg commented 6 years ago

From the ISAs with reciprocal square-root estimate which one has the largest error?

Some (most?) ISAs also provide instructions that perform single Newton-Raphson iterations for this operation, so an alternative here would be to just specify an upper bound on the error, and use the reciprocal square-root instruction of the ISA when it satisfies the upper bound, and add one or more NR-iterations to that result as required in ISAs where this is not the case.

lemaitre commented 6 years ago

From the ISAs with reciprocal square-root estimate which one has the largest error?

I would say Neon, but I don't remember well. Most documents I have found don't specify the precision of this instruction. And some give a maximum error of 1/4096 (12 bits) which would be the same as x86, but in my experience, it's less accurate than intel's: either arm is less accurate than 12 bits, or Intel is more accurate...

Some (most?) ISAs also provide instructions that perform single Newton-Raphson iterations for this operation

Only Neon supports such an instruction. With other ISAs, you need to implement the NR iteration yourself with a bunch of MUL/FMA.

an alternative here would be to just specify an upper bound on the error, and use the reciprocal square-root instruction of the ISA when it satisfies the upper bound, and add one or more NR-iterations to that result as required in ISAs where this is not the case.

I tend to think that's the way to do it. Actually, with what I proposed, the accuracy requirement was not fixed but embedded within the instruction.

Some extra thoughts: reciprocal (square root) on double precision is not available one SSE and AVX (but is available on Neon 64, VSX and AVX512). It is possible to do a fast and rough estimation in that case though.

AVX512 has now instructions that are more accurate for those operations: 14 and 28 bits

gnzlbg commented 6 years ago

Do you know the name of the VSX instruction that supports 64-bit rsqrte on powerpc? The 32-bit instruction also has an error of 1/4096 like the SSE instructions.

MIPS MSA supports 32-bit and 64-bit reciprocal square-root estimate with an error of at most 2 ULPs.

Also worst case one can implement these by doing a full vector square root, which most ISAs support, followed by a divide. That's slow, but would still be a conforming implementation.

gnzlbg commented 6 years ago

It seems that PowerPC implements this for 64-bit floats with 14-bit precision. So 12-bit precision for 32-bit and 14-bit precision for 64-bit look like "reasonable" upper bounds on the error to me.

lemaitre commented 6 years ago

Do you know the name of the VSX instruction that supports 64-bit rsqrte on powerpc?

On 64 bits, there is xvrsqrtesp for F32 and xvrsqrtedp for F64 (both are available through the C intrinsics vec_rsqrte). I would assume they have the same precision, I haven't put any effort to confirm this feeling.

MIPS MSA supports 32-bit and 64-bit reciprocal square-root estimate with an error of at most 2 ULPs.

I never used MIPS, so you're telling me.

It seems that PowerPC implements this for 64-bit floats with 14-bit precision. So 12-bit precision for 32-bit and 14-bit precision for 64-bit look like "reasonable" upper bounds on the error to me.

I have to disagree with your conclusion: One way to implement the fast reciprocal for F64 when you don't have any instruction is to convert your input fo F32, call the fast reciprocal on F32, and convert back to F64. Doing that you "inherit" the precision from F32. By setting the F64 accuracy requirement to be slightly higher than for F32, you clearly forbid such an implementation.

Actually, that's why I would recommend the precision not to be fixed: The instruction could take an immediate to specify the accuracy required, and when the assembly is generated: the fast reciprocal instruction will be used and enough NR iterations will be generated to achieve the accuracy required depending on the actual hardware.

So users should not write NR themselves and just specify they want for instance a 20 bit accurate result and that's it. However, I think the accuracy requirement should be an immediate and not a runtime argument, but that's up to debate.

Also worst case one can implement these by doing a full vector square root, which most ISAs support, followed by a divide. That's slow, but would still be a conforming implementation.

I totally agree with you on that point. By the way, if the target precision is too high, that implementation can be the fastest implementation possible.

gnzlbg commented 6 years ago

I have to disagree with your conclusion: One way to implement the fast reciprocal for F64 when you don't have any instruction is to convert your input fo F32, call the fast reciprocal on F32, and convert back to F64. Doing that you "inherit" the precision from F32. By setting the F64 accuracy requirement to be slightly higher than for F32, you clearly forbid such an implementation.

While I think this is what some ISAs actually do (does PowerPC 64 do this?), the largest representable f32 is 3.4 x 10^38 while the largest representable f64 is 1.80 x 10^308. There is a whole range of f64s that just are not representable in f32s. At that point, why expose the version for f64 at all? I'd argue that it would be better to just not expose it, and let the user do the truncation to f32 however it sees fit (INFINITY, clamp, etc.).

Actually, that's why I would recommend the precision not to be fixed: The instruction could take an immediate to specify the accuracy required, and when the assembly is generated: the fast reciprocal instruction will be used and enough NR iterations will be generated to achieve the accuracy required depending on the actual hardware.

This is actually a really interesting idea.

However, I think the accuracy requirement should be an immediate and not a runtime argument,

I think making them an immediate is reasonable: those who want run-time behavior can just branch on it (e.g. have a switch that calls the intrinsic for different immediates), that's what the compiler will have to generate if it were to accept a run-time argument anyways (although the compiler might be able to be a bit more clever about the switch since it might know for a particular target which options are available).

lemaitre commented 6 years ago

While I think this is what some ISAs actually do (does PowerPC 64 do this?), the largest representable f32 is3.4 x 10^38while the largest representablef64is1.80 x 10^308. There is a whole range off64s that just are not representable inf32s. At that point, why expose the version forf64at all? I'd argue that it would be better to just not expose it, and let the user do the truncation tof32` however it sees fit.

I never said that was the only way to implement it. That's true this specific implementation does not support input outside the range of F32.

Another way to implement it is with some bit hacks: https://en.wikipedia.org/wiki/Fast_inverse_square_root This article presents only the F32 bit hack (which is not worth anymore), but it can also be done for F64, and it is still worth depending on the accuracy you need. It requires more NR iterations, but nothing fancy. This technique support the whole range of F64 (except subnormals)

I think making them an immediate is reasonable: those who want run-time behavior can just branch on it (e.g. have a switch that calls the intrinsic for different immediates), that's what the compiler will have to generate if it were to accept a run-time argument anyways (although the compiler might be able to be a bit more clever about the switch since it might know for a particular target which options are available).

Agreed. I don't see any valid use case for a runtime precision requirement anyway.

ngzhian commented 3 years ago

So users should not write NR themselves and just specify they want for instance a 20 bit accurate result and that's it. However, I think the accuracy requirement should be an immediate and not a runtime argument, but that's up to debate.

This is an interesting area to explore. Implementing the NR step seems simple enough on non-NEON backends. I would need some help figuring out how many steps of NR to perform to achieve the specified accuracy:

From https://www.felixcloutier.com/x86/rcpps, the relative error is |Relative Error| ≤ 1.5 * 2−12, and from https://en.wikipedia.org/wiki/Division_algorithm#Newton.E2.80.93Raphson_division : "the number of correct digits in the result roughly doubles for every iteration". So if the user wants 28 bits of accuracy for a float, how many rounds do I need?

(Sorry I know nothing about numerical analysis, any pointers here will be helpful)

Maratyszcza commented 3 years ago

Here is an old unpublished paper of mine. While its main topic is not related to reciprocal (sqrt) approximation operations, it does contain probably the best description and semi-analytical model of the subject in Section 4.

ngzhian commented 2 years ago

As discussed in https://github.com/WebAssembly/meetings/blob/main/simd/2022/SIMD-02-18.md#aob (search for reciprocal), this is likely out of scope, adding a label to indicate so.

WebAssembly / relaxed-simd

Implementation-dependent reciprocal [sqrt] approximation instructions #4