JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.64k stars 5.48k forks source link

fma operation? #6330

Closed StefanKarpinski closed 9 years ago

StefanKarpinski commented 10 years ago

LLVM has a floating-point fma operation:

http://llvm.org/docs/LangRef.html#llvm-fmuladd-intrinsic

Does it make sense to have an fma function? It makes sense for floating-point, can be emulated with integers and other numeric types with a simple fallback, and could allow matrix code doing A*B + C to avoid temporaries.

simonbyrne commented 10 years ago

I think we need some sort of interface, but I'm not sure what would be best. @stevengj mentioned that LLVM automatically rewrites multiply/add expressions when hardware fma is available, but it would be useful to have some control over this: in fact this is one of the recommended "should"s of the IEEE 754 standard (section 10.4).

In particular, it would be useful to:

stevengj commented 10 years ago

Note that if you go this route, you should really have access to four intrinsics:

fma(x, y, z) = x * y + z
fms(x, y, z) =  x * y - z
fnma(x, y, z) = -x * y + z
fnms(x, y, z) = -x * y - z

(Plus a lot more if you go to SIMD versions, since there are vector instructions that allow you to interleave fma and fms within the vector, for example.)

The C99 standard screwed up in providing just one. We had to roll our own in FFTW. @matteo-frigo may know more about the current state of compiler enlightenment with these sorts of instructions.

The right thing would be for LLVM (and Julia) to preserve an FMA when it occurs explicitly in the code, so that you can get the desired FMA just by parenthesizing your code in the desired way. As I mentioned on the mailing list, gcc didn't always do this, but I don't know where LLVM stands. In theory, I guess there should also be a way to prevent FMAs, although I've never had a practical need for this.

StefanKarpinski commented 10 years ago

Oh gawd. Do not want.

matteo-frigo commented 10 years ago

Just to make it more interesting, here are a couple of issues with FMA, in addition to those pointed out by stevengj.

First, the FMA implemented by PowerPC and by Intel differ in (at least) one subtle way. PowerPC computes +/-(a_b +/- c), whereas Intel implements +/-(ab) +/- c. The two differ when the result is -0; for example, -(+0+0 - +0) = -0, whereas +0 - +0_+0 = +0. So a pedantic (I dare not say correct) compiler cannot translate the PowerPC expression into the x86 expression.

Second, common-subexpression elimination makes things interesting. For example, FFTs tend to compute pairs of expressions of the form a +/- b_c, which reduces to two fma's. However, a sufficiently smart compiler will tend to compute t=b_c; a +/- t, which is three flops. Historically in fftw, we were able to defeat gcc-3.x with some hackish workaround, but gcc-4.[0-2] had a more formidable optimizer that was always isolating the common subexpression no matter what. I haven't tried more recent gcc's or llvm in this respect, but that's something to watch for.

eschnett commented 10 years ago

OpenCL has two fma operations, fma and mad. The former (fma) specifies that the fused, more accurate semantics is required; unless the equivalent of -ffast-math is required, this can be expensive if the hardware does not support it. (fma() also exists in C.) The latter (mad) is simply a combined multiply-add operation that simplifies the compiler's task of generating efficient code; it will only translate to a hardware instruction if this is faster than a multiply-add sequence.

Which of these do we want for Julia? Both? fma() is clearly useful -- if you need the increased accuracy, you need it. mad() is a nice convenience to have, in particular for types (matrices?) where the compiler otherwise would not be able to synthesize the combined operation.

simonbyrne commented 10 years ago

I've been thinking about this recently as well. From my own limited experience, there are 3 common situations: 1) Use them if more efficient; if not, fall back on separate instructions: e.g. matrix multiplication, Horner polynomial evaluation. 2) Do not use them at all: e.g. complex multiplication, quadratic determinant. 3) If they're available in hardware, do one thing, otherwise do something else. e.g. double-double arithmetic.

I'm not sure where the situation mentioned by @matteo-frigo falls, or if it deserves another category.

(1) is perhaps the most common:

(2) is the current behaviour. If we were following either of the first two or the macro approach above, we might want to consider an @fmanosubst macro for disabling @fmasubst in subblocks.

(3) I'm not sure the best way to handle this. We could have a global const NATIVEFMA::Bool, indicating whether or not the instruction is available, as well as the fma function proposed by @eschnett. The mulpi_ext function from #8056 could then be written as

function mulpi_ext(x::Float64)
    m = 3.141592653589793
    m_hi = 3.1415926218032837
    m_lo = 3.178650954705639e-8

    y_hi = m*x
    if NATIVEFMA
        # with fma()
        y_lo = fma(m_lo, x, fma(m_hi, x, -y_hi))
    else
        # without fma()
        u = 134217729.0*x # 0x1p27 + 1
        x_hi = u-(u-x)
        x_lo = x-x_hi
        y_lo = x_hi * m_lo + (x_lo* m_hi + ((x_hi*m_hi-y_hi) + x_lo*m_lo))
    end
    Double64(y_hi,y_lo)
end

Admittedly, it would be a bit messy, but mostly would be restricted to library code. This also assumes that the compiler is able to replace the fma(A,B,-C) with a fused multiply-subtract (I think that this is a valid transformation?)

eschnett commented 10 years ago

Isn't your case (3) the same as (1)? Whether an instruction is available in hardware or needs to be open-code as a sequence of instructions only affects performance, hence this is a choice about how fast things should run.

The fma() operation (without intermediate rounding) is part of the IEEE standard; we should provide it in any case.

Your explicit code transformation is equivalent to @fastmath or -ffast-math. There are typically a few similar IEEE features where the programmer may want to specify "I didn't think about associativity here; feel free to re-associate for performance" or "I don't care about inf/nan/denormal semantics here; feel free to optimize".

I'd go for expressing semantics only in the code, with muladd() as a convenience function and fma() as per IEEE standard. Optimizations and allowed transformations come later, either via macros or via compiler options.

simonbyrne commented 10 years ago

Isn't your case (3) the same as (1)? Whether an instruction is available in hardware or needs to be open-code as a sequence of instructions only affects performance, hence this is a choice about how fast things should run.

The distinction is that in (1) you're happy for the fallback to be simply a*b+c (i.e. you just want the fma for speed) whereas in (3) you're not (i.e. you're using the fma for the extra accuracy). In some sense, this mirrors the distinction between your proposed muladd and fma functions.

The other point is that in (3) you probably want to write the fallback separately, as software fma's are woefully inefficient: e.g. see the openlibm code.

+1 for a @fastmath macro.

simonbyrne commented 9 years ago

I think this issue can now be closed, as #9855 covers the indicator issue.