Closed StefanKarpinski closed 9 years ago
I think we need some sort of interface, but I'm not sure what would be best. @stevengj mentioned that LLVM automatically rewrites multiply/add expressions when hardware fma is available, but it would be useful to have some control over this: in fact this is one of the recommended "should"s of the IEEE 754 standard (section 10.4).
In particular, it would be useful to:
a*b+c*d
: part of the problem here is that julia ignores unnecessary parentheses, so it is impossible to specify which of the multiplications should be fused.Note that if you go this route, you should really have access to four intrinsics:
fma(x, y, z) = x * y + z
fms(x, y, z) = x * y - z
fnma(x, y, z) = -x * y + z
fnms(x, y, z) = -x * y - z
(Plus a lot more if you go to SIMD versions, since there are vector instructions that allow you to interleave fma and fms within the vector, for example.)
The C99 standard screwed up in providing just one. We had to roll our own in FFTW. @matteo-frigo may know more about the current state of compiler enlightenment with these sorts of instructions.
The right thing would be for LLVM (and Julia) to preserve an FMA when it occurs explicitly in the code, so that you can get the desired FMA just by parenthesizing your code in the desired way. As I mentioned on the mailing list, gcc didn't always do this, but I don't know where LLVM stands. In theory, I guess there should also be a way to prevent FMAs, although I've never had a practical need for this.
Oh gawd. Do not want.
Just to make it more interesting, here are a couple of issues with FMA, in addition to those pointed out by stevengj.
First, the FMA implemented by PowerPC and by Intel differ in (at least) one subtle way. PowerPC computes +/-(a_b +/- c), whereas Intel implements +/-(ab) +/- c. The two differ when the result is -0; for example, -(+0+0 - +0) = -0, whereas +0 - +0_+0 = +0. So a pedantic (I dare not say correct) compiler cannot translate the PowerPC expression into the x86 expression.
Second, common-subexpression elimination makes things interesting. For example, FFTs tend to compute pairs of expressions of the form a +/- b_c, which reduces to two fma's. However, a sufficiently smart compiler will tend to compute t=b_c; a +/- t, which is three flops. Historically in fftw, we were able to defeat gcc-3.x with some hackish workaround, but gcc-4.[0-2] had a more formidable optimizer that was always isolating the common subexpression no matter what. I haven't tried more recent gcc's or llvm in this respect, but that's something to watch for.
OpenCL has two fma operations, fma and mad. The former (fma) specifies that the fused, more accurate semantics is required; unless the equivalent of -ffast-math is required, this can be expensive if the hardware does not support it. (fma() also exists in C.) The latter (mad) is simply a combined multiply-add operation that simplifies the compiler's task of generating efficient code; it will only translate to a hardware instruction if this is faster than a multiply-add sequence.
Which of these do we want for Julia? Both? fma() is clearly useful -- if you need the increased accuracy, you need it. mad() is a nice convenience to have, in particular for types (matrices?) where the compiler otherwise would not be able to synthesize the combined operation.
I've been thinking about this recently as well. From my own limited experience, there are 3 common situations: 1) Use them if more efficient; if not, fall back on separate instructions: e.g. matrix multiplication, Horner polynomial evaluation. 2) Do not use them at all: e.g. complex multiplication, quadratic determinant. 3) If they're available in hardware, do one thing, otherwise do something else. e.g. double-double arithmetic.
I'm not sure where the situation mentioned by @matteo-frigo falls, or if it deserves another category.
(1) is perhaps the most common:
-ffash-math
command-line option would be another approach, though would be a nightmare for writing reliable code, and difficult to test (all unit tests would have to be run twice)mad
function seems like a reasonable approach, though could reduce readability. @fmasubst
macro, that would allow the compiler to substitute freely within certain blocks.
As I mentioned above, it would be nice to have tools to determine if substitution occurred, though these don't need to be part of the core language.(2) is the current behaviour. If we were following either of the first two or the macro approach above, we might want to consider an @fmanosubst
macro for disabling @fmasubst
in subblocks.
(3) I'm not sure the best way to handle this. We could have a global const NATIVEFMA::Bool
, indicating whether or not the instruction is available, as well as the fma
function proposed by @eschnett. The mulpi_ext
function from #8056 could then be written as
function mulpi_ext(x::Float64)
m = 3.141592653589793
m_hi = 3.1415926218032837
m_lo = 3.178650954705639e-8
y_hi = m*x
if NATIVEFMA
# with fma()
y_lo = fma(m_lo, x, fma(m_hi, x, -y_hi))
else
# without fma()
u = 134217729.0*x # 0x1p27 + 1
x_hi = u-(u-x)
x_lo = x-x_hi
y_lo = x_hi * m_lo + (x_lo* m_hi + ((x_hi*m_hi-y_hi) + x_lo*m_lo))
end
Double64(y_hi,y_lo)
end
Admittedly, it would be a bit messy, but mostly would be restricted to library code. This also assumes that the compiler is able to replace the fma(A,B,-C)
with a fused multiply-subtract (I think that this is a valid transformation?)
Isn't your case (3) the same as (1)? Whether an instruction is available in hardware or needs to be open-code as a sequence of instructions only affects performance, hence this is a choice about how fast things should run.
The fma() operation (without intermediate rounding) is part of the IEEE standard; we should provide it in any case.
Your explicit code transformation is equivalent to @fastmath or -ffast-math. There are typically a few similar IEEE features where the programmer may want to specify "I didn't think about associativity here; feel free to re-associate for performance" or "I don't care about inf/nan/denormal semantics here; feel free to optimize".
I'd go for expressing semantics only in the code, with muladd() as a convenience function and fma() as per IEEE standard. Optimizations and allowed transformations come later, either via macros or via compiler options.
Isn't your case (3) the same as (1)? Whether an instruction is available in hardware or needs to be open-code as a sequence of instructions only affects performance, hence this is a choice about how fast things should run.
The distinction is that in (1) you're happy for the fallback to be simply a*b+c
(i.e. you just want the fma for speed) whereas in (3) you're not (i.e. you're using the fma for the extra accuracy). In some sense, this mirrors the distinction between your proposed muladd
and fma
functions.
The other point is that in (3) you probably want to write the fallback separately, as software fma's are woefully inefficient: e.g. see the openlibm code.
+1 for a @fastmath
macro.
I think this issue can now be closed, as #9855 covers the indicator issue.
LLVM has a floating-point fma operation:
http://llvm.org/docs/LangRef.html#llvm-fmuladd-intrinsic
Does it make sense to have an
fma
function? It makes sense for floating-point, can be emulated with integers and other numeric types with a simple fallback, and could allow matrix code doingA*B + C
to avoid temporaries.