The gfortran benchmarks should use the "-march=native -ffast-math -funroll-loops" options

JuliaLang / julia

The Julia Programming Language

https://julialang.org/

MIT License

45.05k stars 5.43k forks source link

The gfortran benchmarks should use the "-march=native -ffast-math -funroll-loops" options #24568

Closed certik closed 6 years ago

certik commented 6 years ago

Specifically, when I compare

gfortran -O3 perf.f90 && ./a.out
gfortran -O3 -march=native -ffast-math -funroll-loops perf.f90 && ./a.out

on my machine, then I get 2x speedup on the iteration_pi_sum benchmark. I haven't tested the C code, but a similar speedup might be possible.

Note: if you use a recent gfortran compiler (e.g. I just tested this with 7.2.0), you can just use:

gfortran -Ofast perf.f90

And it will turn on the proper options for you, and produce optimal results.

Keno commented 6 years ago

If we do that, we should allow julia the same liberties of course ;).

StefanKarpinski commented 6 years ago

Fast math mode is not functionally identical, so that's changing the algorithm (and potentially the result). Unrolling loops is fair game, but most compilers should consider doing that anyway.

certik commented 6 years ago

@StefanKarpinski I checked on my machine and indeed the -ffast-math seems to be solely responsible to bring the iteration_pi_sum benchmark down from 20 (ms?) to 10 ms.

Regarding your point about changing algorithm --- it all depends what you want to show by the benchmarks. I use Fortran every day. I use -Ofast. Intel's ifort I believe does the same thing, in fact by default. As such, your interpretation of the benchmarks is not applicable to my work, since I use the best optimization options that the compiler allows.

Maybe a good compromise could be to mention that you are not using the -ffast-math in gfortran (because you feel that's not fair to Julia), which can deliver up to 2x speedup on some benchmarks.

Btw., do you have a real life example when -ffast-math fails? I've been recommending it to everyone, so if you have a counter example, please let me know.

My final thought is that a fair benchmark would be to simply compare against LLVM based compilers. That way everybody is using the same backend, and thus we are benchmarking how well each language can be optimized into LLVM IR. I think that would be meaningful. Well, except that fact, that there is currently no production LLVM Fortran compiler, but eventually I believe that will change. But at least for the C language I think it will make sense to use clang and exactly the same LLVM version as Julia.

Since there is no Fortran LLVM compiler yet, then I think the next best thing is to let both LLVM and gfortran to do their best. This includes -ffast-math for gfortran, and an equivalent option for Julia/LLVM. Otherwise you need to prove that none of the dozens or so gfortran options that are triggered by -ffast-math are used by LLVM.

Keno commented 6 years ago

I don't care either way, but julia has ffast-math as well of course, so either we turn it on everywhere or we turn it off everywhere. As for real world examples, ffast-math turns on ffinite-math-only, which is a big problem for people who use float NaNs as missing data.

certik commented 6 years ago

@Keno do you know if that's how Pandas and NumPy does it? That would be a good argument against doing it by default. For numerical computing, I don't use NaNs --- except in Debug build when I want it to fail when a NaN would be generated. I use the following option in gfortran in Debug mode: -ffpe-trap=invalid,zero,overflow -finit-real=snan -finit-integer=-99999999. And I always use both Debug and Release modes, both in Fortran and in C++. In Julia, do most people just use one mode?

Keno commented 6 years ago

In julia fastmath is a local annotation, so the user can specify whether they are ok with non-IEEE behavior or not in a particular place in the code base.

StefanKarpinski commented 6 years ago

Fundamentally "fast math" does not compute the same thing, so we could either turn this on in every language or turn it on in no languages. Many languages don't support fast math at all, which is definitely a mark against them for high-performance computation, but if we do it in some but not others, then we're not comparing apples to apples. This hurts languages like Julia, Fortran and C where this is an option in the comparison, but those are also the languages that are kicking everyone else's butts, so I think we can afford it.

oscardssmith commented 6 years ago

One possibility would be to report two different results for pisum for languages with fastmath. This would acknowledge the existence of a good feature, while not being an unfair comparison.

romeric commented 6 years ago

I believe comparing the performance of different languages/compilers under -ffast-math mode would be fundamentally flawed. While under -O1/-O2/-O3 levels there are still set floating point guidelines to follow (IEEE), there is no such rule that the compilers have to follow under -ffast-math, and one cannot expect all the compilers to do the same optimisation when IEEE floating point compliancy is broken.

Moreover, there is no fast-math mode for languages like Python, MATLAB/Octave, R etc. NumPy/SciPy are typically built with the same flags that the CPython itself was built with and under most environments that is -O2.

The actual reason that the pisum benchmark runs twice faster is that under -ffast-math the compiler emits an AVX vectorised version of the code instead of a SSE version and that is precisely the reason for a 2X speed-up. Clang/GCC do the same optimisation for the C version of the code as well. Easy to check this if you look at the generated assembly. I think the reason, why this optimisation does not kick in under -O3 is that the compiler may not be able to guarantee the same statistical distribution of the error, while choosing to work with bigger vector lanes.

certik commented 6 years ago

So I think your argument is that you want to set the playground rules, and those are the IEEE floating point arithmetic. My argument is that IEEE rules are not the relevant rules, and I asked above if somebody can provide an example where -ffast-math produces incorrect code, from the computational numerical perspective. Instead of talking abstractly, let's discuss a particular example where -ffast-math fails from any practical perspective, e.g .that some numerical algorithm fails due to that (or some convergence rate changes, etc.).

If such an example does not exist, then you can give up IEEE, since it does not affect any good numerical algorithm in practice. That is my argument.

StefanKarpinski commented 6 years ago

Here's a simple example of computing the error of a floating-point addition:

#include <math.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
    double a = atof(argv[1]);
    double b = atof(argv[2]);
    if (fabs(a) < fabs(b)) { double t = b; b = a; a = t; }
    double e = ((a + b) - a) - b;
    printf("addition error for `%g + %g`: %g\n", a, b, e);
    return 0;
}

IEEE versus fast math:

$ gcc -O3 tmp/fastmath.c -o tmp/ieeemath
$ gcc -O3 -ffast-math tmp/fastmath.c -o tmp/fastmath

$ tmp/ieeemath 1.12e17 12.34
addition error for `1.12e+17 + 12.34`: 3.66

$ tmp/fastmath 1.12e17 12.34
addition error for `1.12e+17 + 12.34`: 0

It's extremely naïve to think that things at least this subtle don't happen in real numerical codes.

certik commented 6 years ago

@StefanKarpinski thanks for taking the time to figure out an example. Here is the same example that you can try in Fortran with, say, the Intel compiler:

program fastmath
integer, parameter :: dp = kind(0.d0)
real(dp) :: a, b, t, e
a = get_float_arg(1)
b = get_float_arg(2)
if (abs(a) < abs(b)) then
    t = b; b = a; a = t
end if
e = ((a + b) - a) - b
print "('addition error for `', es9.2, ' + ', es9.2, '`: ', es9.2)", a, b, e

contains

    real(dp) function get_float_arg(i) result(r)
    integer, intent(in) :: i
    integer, parameter :: maxlen=32
    character(len=maxlen) :: arg
    integer :: s
    if (.not. (0 < i .and. i <= command_argument_count())) then
        error stop "get_float_arg: `i` must satisfy `0 < i <= arg_count`."
    end if
    call get_command_argument(i, arg, status=s)
    if (s == -1) then
        error stop "get_float_arg: Argument too long, increase `maxlen`."
    else if (s > 0) then
        error stop "get_float_arg: Argument retrieval failed."
    else if (s /= 0) then
        error stop "get_float_arg: Unknown error."
    end if
    read(arg, *, iostat=s) r
    if (s /= 0) then
        error stop "get_float_arg: Failed to convert an argument to a float."
    end if
    end function

end

And run it as:

$ ifort fastmath.f90 -o intelmath
$ ./intelmath 1.12e17 12.34
addition error for ` 1.12E+17 +  1.23E+01`:  0.00E+00

Gfortran here behaves as gcc:

$ gfortran fastmath.f90 -o gfortranmath
$ ./gfortranmath 1.12e17 12.34
addition error for ` 1.12E+17 +  1.23E+01`:  3.66E+00
$ gfortran -Ofast fastmath.f90 -o gfortranmathfast
$ ./gfortranmathfast 1.12e17 12.34
addition error for ` 1.12E+17 +  1.23E+01`:  0.00E+00

The Intel compiler has the expected behavior, since in Fortran the compilers have the freedom to rewrite an expression to an equivalent mathematical form (based on real numbers, not floating point), so ((a + b) - a) - b and 0 are equivalent. Now, strictly speaking, here is what the Fortran 2008 Standard says:

7.1.5.2.4 Evaluation of numeric intrinsic operations ...

Once the interpretation of a numeric intrinsic operation is established, the processor may evaluate any mathematically equivalent expression, provided that the integrity of parentheses is not violated.

Two expressions of a numeric type are mathematically equivalent if, for all possible values of their primaries, their mathematical values are equal. However, mathematically equivalent expressions of numeric type may produce different computational results.

In other words, the expression as used above:

e = ((a + b) - a) - b

has only one way to interpret it, and the Intel compiler indeed broke the "integrity of parentheses". That is confirmed by an example in the Standard, which says that the expression (X + Y) + Z has a "forbidden alternative form" X + (Y + Z). However, if you change it to:

e = a + b - a - b

Then the ifort as well as gfortran results above do not change, but now the Standard allows to evaluate a mathematical equivalent expression 0 instead, and even gives an example of an expression X - Y + Z that has an "allowable alternative form" of X - (Y - Z), so a + b - a - b is rewritten to a+b-(a+b) and that is rewritten to 0.

The conclusion is that the Fortran Standard encourages this kind of rewriting, but even the Standard says not to break parentheses, but in practice compilers actually break them. An Intel Fortran is a good, solid reputable compiler, and it does this by default, as you can see above. In light of the above I must strongly disagree this is "extremely naïve". It might not be what you want to do in Julia, but it's not naïve.

If you use icc, then in fact, Intel breaks this by default even in your own C example above with parentheses:

$ icc fastmath.c 
$ ./a.out 1.12e17 12.34
addition error for `1.12e+17 + 12.34`: 0

Now, when we have the standard and compiler behavior discussed, the more important problem with the above example is that when you write numerical code, you should not rely on such behavior. Besides the fact the Intel compiler breaks it by default both in C and in Fortran, the more important problem is that such a code is unreliable. As an example, where you seem to depend on such a behavior in Julia (https://github.com/JuliaMath/Roots.jl/pull/12), consider this blog post that the PR is based on:

http://www.shapeoperator.com/2014/02/22/bisecting-floats/

where it is claimed that instead of using abs(fn(x1)) > tol || abs(fn(x2)) > tol one should use x1 < (x1 + x2)/2 < x2, and even that is later replaced with casting floats to integers and back. So first of all, one should never cast a float to integer, as that is not portable (and also not readable). Second of all, the version x1 < (x1 + x2)/2 < x2 is also not portable, since it's good to assume that the Intel compiler can improve, and since the Fortran standard encourages rewriting using mathematically equivalent expressions, the compiler knows that x1 < x2, and as such can trivially replace x1 < (x1 + x2)/2 by .true., and the above loop will never terminate. Rather, the correct solution is to use tolerances, but instead of tolerances on the values of the function, in the above case one should simply use x2 - x1 > tol. Then one has to set a problem dependent tolerance, but that one has to do anyway. In fact, many times one does not want to go to machine accuracy, but a given problem might only require say 1e-6 or even less.

When numerical code is written like that (e.g., x2 - x1 > tol) then it does not depend on IEEE conventions, in fact it does not even depend on the details of the floating point implementation. One always has to do a convergence study (in the case of bisection it's trivial, but for more complicated algorithms one obtains a noise in the values once you reach convergence, say with a mesh refinement, and then you set the tolerance based to be safely above the noise). To get an example of what I am talking about, look at the Fig. 1 in arXiv:1209.1752, where an error is plotted based on various convergence parameters, and it always behaves the same: it drops, but then it hits a certain level and stays there --- that's when it's converged. One can plot the bisection in the same way. In both cases, if I set the tolerance below this minimal error level, then the algorithm will never terminate. Such is a nature of numerical computing. Typically the tolerance is problem dependent, though one can try to use relative, or even both absolute and relative errors at the same time, to mitigate that a bit, but in the end, it's going to be problem dependent.

Coming to your example above, I claim that one should never write a numerical algorithm that depends on such a behavior. You are subtracting two large numbers, so by definition you lost all the accuracy. The result can be anything, and one should never rely on such a result in any way.

Based on my reasoning above, the conclusion is: when writing numerical code, write it mathematically, and assume that you can use any mathematically equivalent expressions, and at the same time, avoid numerical constructs which lose accuracy in floating point (such as subtracting two large numbers), never assume any particular representation of floating point (but you can and should assume how many significant digits can be represented) and compare floating point numbers using a tolerance. There might be a few more rules that I forgot.

I welcome anyone to find holes in my arguments. I would be most interested to see where my reasoning is incorrect.

PallHaraldsson commented 6 years ago

@certik, I do not want to contradict @Keno "In julia fastmath is a local annotation", yes, but not only? julia flag:

--math-mode={ieee,fast} Disallow or enable unsafe floating point optimizations (overrides @fastmath declaration)

@romeric "The actual reason that the pisum benchmark runs twice faster is that under -ffast-math the compiler emits an AVX vectorised version of the code instead of a SSE version"

I believe you should get that with Julia, for local or global setting; you can at least try. Does Fotran and C only have this similar global blunt instrument? It's good for performance testing, but you really should enable locally?

It would be good to indicate at least in a footnote, that you can get faster in Julia and some other languages, even if numerically less accurate, and point out Julia's advantages.

Keno commented 6 years ago

The math-mode command line flag is quite problematic, since there's a bunch of code in base an elsewhere that assumes NaNs are possible. E.g. https://github.com/JuliaLang/julia/issues/21375

StefanKarpinski commented 6 years ago

Now, when we have the standard and compiler behavior discussed, the more important problem with the above example is that when you write numerical code, you should not rely on such behavior.

Without reliable associativity as guaranteed by IEEE, you cannot write important and useful algorithms like Kahan summation (the error term is always zero mathematically). Julia can and does use this kind of code for correct answers in many places in the standard library. Perhaps your numerical code does not rely on a correct implementation of floating-point associativity, but insisting that everyone else has a moral obligation not to rely on it strikes me as a rather myopic position. If everyone thought like this, we'd still live in the dark bad old days before IEEE. If the compiler is allowed to reassociate summation arbitrarily, you can literally get any possible answer when summing a small, fixed set of floating-point numbers. Numerical code goes from being fully well-defined with IEEE to being completely implementation-specific without it – even for something as basic as adding a bunch of numbers. What results you get depends on what your compiler chooses to do. And it can change from compiler to compiler and version to version. That's a fairly awful situation to be in for anyone who cares about reproducibility.

Regarding Intel compilers: I consider not following IEEE behavior without opt-in to be a bug. This may be part of why their compilers have such a reputation for being fast, yet everyone actually uses gcc and clang (although historical cost likely has a great deal to do with it as well). This is probably also a large part of why libraries like openlibm cannot be safely compiled with icc.

All that said, I'm pretty tired of arguing about this, so feel free to make a PR adding a "fast math" versions of the Fortran, C and Julia benchmarks. (Please do not make a PR just changing the compiler options for Fortran from IEEE to fast math.)

StefanKarpinski commented 6 years ago

Also: your argument conflates numerical instability with incorrect answers due to invalid optimizations. These are not the same thing at all. However, tricky as these are, they are predictable and a skilled numerical analyst can tell you exactly how much error round-off can occur in a given computation. It seems like you may be falling into the common trap of imagining that floating-point operations introduce some sort of unpredictable error or fuzziness. And if that's the case, then why not give the compiler license to produce different fuzzy results? But this is a fundamental misunderstanding of floating-point arithmetic: it is completely predictable and always gives the correct answer rounded to the representable precision. Yes, the rounding introduces error, but in a well-defined, predictable way. There is no gremlin introducing random mistakes in the last digit. When using -ffast-math, there is a gremlin introducing random mistakes – the compiler.

certik commented 6 years ago

@StefanKarpinski the summation example is clever, but ultimately it's just a curiosity, since even Julia can't sum it correctly:

julia> foldl(+, (sumsto(10.)))
10.0

julia> sum(sumsto(10.))
1.99584030953472e292

julia> sum_kbn(sumsto(10.))
9.9792015476736e291

and the returned array of numbers are large and small, all over the place:

julia> sumsto(10.)
2046-element Array{Float64,1}:
  1.79769e308 
  4.94066e-324
  ⋮           
  2.4948e291  
  4.9896e291  
 -1.79769e308 
  2.0         
  8.0

So that breaks my rule above "avoid numerical constructs which lose accuracy", since we are summing large and small numbers, and as such it is something to be avoided when writing numerical code.

The openlibm however is a very good example, that applies. So I opened up an issue (https://github.com/JuliaLang/openlibm/issues/169) with the icc failures, let me have a look at what is going on. For example the exp2 test gives a completely wrong answer with icc:

Failure: Test: exp2 (0.7) == 1.6245047927124710452
Result:
 is:          1.62362532701732886764e+00   0x1.9fa5e8d07f29e0000000p+0
 should be:   1.62450479271247094637e+00   0x1.9fdf8bcce533d0000000p+0
 difference:  8.79465695142078729418e-04   0x1.cd17e3304f8000000000p-11
 ulp       :  3960761376927.0000
 max.ulp   :  0.0000
Maximal error of `exp2'
 is      :  3960761376927.0000 ulp
 accepted:  0.0000 ulp

That's unacceptable. We would have to find exactly where the problem is, before we can draw conclusions.

As a workaround, one can always turn off the optimization that broke it. In openlibm case, it's -fp-model=precise, which I think turns on IEEE compliance. Then all tests pass with Intel (https://github.com/JuliaLang/openlibm/issues/169#issuecomment-344413072). That should make icc usable for you, since you like IEEE compliance anyway. It's unsatisfactory for me, so I'll have a look what is going on if I have time.

But this exp2 example is a good example where the -ffast-math broke people's code, and I wanted to have a good example. Thank you for that.

I personally agree with Intel, and their decision happens to be in line with my guidelines how to write numerical code above. And for special cases, one can always turn it off, and I never argued there should be no way to turn it off.

I have submitted PRs to Julia before, but I will not submit this one, because you @StefanKarpinski seem "pretty tired of arguing about this" (using your own words), and I did not appreciate how I was treated in this thread. You @StefanKarpinski attacked me ad-hominem in almost every reply of yours ("extremely naïve", "myopic", "your numerical code vs everybody else", etc.), and I am very tolerant, but this is enough even for me. I have no interest in discussing with people who attack me personally, instead of discussing just technical details. I came here with good intentions to improve the benchmarks, and I have improved your benchmarks before. In fact, I met you @StefanKarpinski at the SciPy conference and also at the Google Mentor Summit. I had beers with you. I am very sad. But, we live in free country, I will always defend your freedom to free speech. But I will not discuss this online anymore. However, if we ever meet in person again, I am happy to buy you a beer so that we can continue this discussion in person. I will now try to forget the attacks and just think about the technical details, where you provided good arguments, even if I disagree with your conclusion.

I am closing this issue, since as @StefanKarpinski explained, for Julia it makes sense to stick with IEEE by default. I initially thought that maybe the -ffast-math option wasn't used by a mistake, but based on the discussion here it is clear you guys have thought about this a lot, and reached a conclusion not to use it. That is fine with me. I think you should still mention this in the benchmarks in the text somewhere.

StefanKarpinski commented 6 years ago

I'm sorry that you feel personally attacked – that certainly was not my intention. We have fundamentally different views here. I apologize for any adjectives I used that you felt were insulting. From my perspective, this is the hundredth debate I've participated in about why language X (in this case Fortran) should be allowed to do something special that makes it faster on these benchmarks. It's quite tiring after the tenth such discussion or so. Our principles regarding these benchmarks have always been simple and consistent:

Every language implements the same computation using the same algorithm.

Turning on -ffast-math or the equivalent in some languages but not others violates that fundamental principle because it means some languages are computing something different than others. I made this point early on, but you insisted that it was wrong to write numerical code that requires IEEE math and demanded examples where compilers not following IEEE made a practical difference – which I've provided several of. You continue to have the perspective that code that relies on IEEE math is wrong. I continue to believe that a compiler which violates IEEE without explicitly being asked to do so is wrong – and is not computing what it was asked to compute.

In the particular case of openlibm, writing correct algorithms without the guarantees of IEEE will end up being be less efficient than just turning IEEE mode arithmetic on. If you are in the business of writing mathematical code that gives consistent, predictable, correct answers, IEEE is not the enemy of performance, it is precisely the set of guarantees you need in order to know that you are doing exactly the minimal amount of work needed to get the right answer.

I hope that I get a chance to buy you a beer when we cross paths in the future and apologize in person for offending you – I got carried away with the argument and with having a similar debate for the hundredth time. For what it's worth, this point of contention is far more interesting and substantive than the previous ones have been.

certik commented 6 years ago

@StefanKarpinski no offense taken, and apology accepted. I didn't know you had hundreds of these debates before. I maintain some large open source projects also, and so I know that some users or developers keep bringing the same issue over and over again --- the same issue for me, but for each of the people who brought it up it was their first time. As it was the case here, I never had such a debate before.

Just a note that I didn't say to allow it for Fortran and not for Julia. Either both or none. With -ffast-math, it might make sense to print the final accuracy/error, since things are not well defined anymore. The task then becomes "give me a given (say, 1e-14) accuracy". For openlibm, if it depends on IEEE, then just turn on IEEE in icc, and then it does the right thing. I've been looking into openlibm, something fishy is going on in the exp2 routine, that IEEE completely breaks it. Once I figure it out, I'll try to rewrite it "my way" (as you say), and we can compare performance. If the performance is worse, then you have a point, and I'll admit that IEEE makes sense (in this particular case). I'll also check that I am getting the correct answer, up to a few ulp. I understand that is not enough for openlibm, where the answer must have 0 ulp error to be IEEE compliant. But that's a different question, for me a few ulp error is completely fine. But 1e-4 error is unacceptable.

This debate made me think that I should write up the rules that I follow when writing numerical code, and put it up for scrutiny. I thought they are well known, but I don't think they are.

StefanKarpinski commented 6 years ago

Just a note that I didn't say to allow it for Fortran and not for Julia. Either both or none.

The issue with this is that the comparison isn't just between Julia and Fortran. C, Fortran and Julia all support "fast math" modes, but other languages don't and we need to apply the principle of "the same computation" to all the languages in the benchmark. That's why it would be ok to add fast math versions, but the basic cross-language comparison has to be IEEE – so that the comparison is fair. We could turn "fast math" on in just C, Fortran and Julia and they would look even better by comparison to other languages, but we're already clobbering them, so that doesn't seem very sporting.

certik commented 6 years ago

@StefanKarpinski I understand this argument and I agree one should compare exactly the same computation. My own view of the benchmarks is that Julia already won against Python or Matlab, so I don't care about those, the case is settled. I do, however, care about the Fortran / Julia benchmark.

Anyway, back to the IEEE issue in openlibm. So I figured out why the exp2 fails, the failure is caused by icc rewriting z = x - ((x+redux)-redux) into 0, which was exactly your example! Funny. I said about your code that one should never do that in practice. I still think that. So what is going in here? Here are the relevant definitions:

#define TBLBITS 8
#define TBLSIZE (1 << TBLBITS)
static const double
    redux    = 0x1.8p52 / TBLSIZE;

And here is the actual original code, that fails in icc:

STRICT_ASSIGN(double, t, x + redux);
t -= redux;
z = x - t;

The t variable is also used for some other stuff, I removed that from this listing, as icc compiles that part properly.

I will explain my reasoning in detail, so that you or others can scrutinize it. As I said, I welcome anybody to find holes in my arguments.

This code is hackish and unreadable. Look at it and tell me in 5s what this is doing. Maybe you can, but I certainly can't tell. Here is what this is doing, which took me a long time to figure out:

z = x - floor(x*TBLSIZE + 0.5) / (double)TBLSIZE;

Now this is a whole different issue -- this is a common pattern, I usually use it the other way round as X = X - L*floor(X/L), which periodically shifts particles to the [0, L]^3 box. In the way it's written above, L=1/TBLSIZE, so we are periodically shifting "x" into a z in a box of [0, 1/TBLSIZE]. The 0.5 in there I think just shifts the box to [-1/TBLSIZE/2, +1/TBLSIZE/2]. This is consistent with the mathematical formulation from the C file:

 * Method: (accurate tables)
 *
 *   Reduce x:
 *     x = 2**k + y, for integer k and |y| <= 1/2.
 *     Thus we have exp2(x) = 2**k * exp2(y).
 *
 *   Reduce y:
 *     y = i/TBLSIZE + z - eps[i] for integer i near y * TBLSIZE.
 *     Thus we have exp2(y) = exp2(i/TBLSIZE) * exp2(z - eps[i]),
 *     with |z - eps[i]| <= 2**-9 + 2**-39 for the table used.

First of all, there is a mistake, it should be x = k + y. Second of all, you can see the 1/TBLSIZE compartment in there. The original code is using redux to do the floor together with the TBLSIZE scaling. If you think about the value of redux in binary:

In [6]: "{0:b}".format(int(float.fromhex("0x1.8p52")))
Out[6]: '11000000000000000000000000000000000000000000000000000'

You can see that it is doing the floor function when you add to it, and then since redux is divided by TBLSIZE=256, that just removes some zeros from the end, and then by adding and subtracting redux, you essentially do the same thing as floor(x*TBLSIZE + 0.5) / (double)TBLSIZE. One can verify this in, e.g., Python:

In [7]: redux = float.fromhex("0x1.8p52") / 256

In [8]: "{0:b}".format(int(redux))
Out[8]: '110000000000000000000000000000000000000000000'

In [45]: x = -0.7; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[45]: (-0.0007812499999999556, -0.0007812499999999556)

In [46]: x = 0.7; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[46]: (0.0007812499999999556, 0.0007812499999999556)

In [47]: x = 0.7; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[47]: (0.0007812499999999556, 0.0007812499999999556)

In [48]: x = -0.7; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[48]: (-0.0007812499999999556, -0.0007812499999999556)

In [49]: x = 1.123; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[49]: (0.0019062499999999982, 0.0019062499999999982)

In [50]: x = -1.123; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[50]: (-0.0019062499999999982, -0.0019062499999999982)

In [51]: x = -0.5; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[51]: (0.0, 0.0)

In [52]: x = 0.5; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[52]: (0.0, 0.0)

But the x-((x+redux)-redux) version is unreadable, and relies on IEEE, so it fails in icc. On the other hand, the version x - floor(x*TBLSIZE + 0.5) / (double)TBLSIZE is corresponding to the math, it is mathematically correct, and thus it will be correct whether or not your floating point numbers follow IEEE, and thus it will also work in icc. And it does. And that is my whole point --- one should not write stuff like x-((x+redux)-redux) which is not portable, and mathematically equal to 0. One should translate the math into code, which in this case is x - floor(x*TBLSIZE + 0.5) / (double)TBLSIZE. This is correct mathematically, and in the code. This is the Fortran philosophy. Make the code look like the math. Formula translation (that's where the name Fortran comes from).

Ok, I am not here to preach, but I want to put this point across, as it is often misunderstood. Even if you don't agree, it's worth understanding the idea.

There is still one problem. The new version is much slower than the original. Well, let's work on that. So what is TBLSIZE? It happens to be 256. Ok, what if it is something else, like 250? Well, then the equivalence does not work anymore:

In [53]: redux = float.fromhex("0x1.8p52") / 250

In [56]: x = 0.71; x-((x+redux)-redux), x-math.floor(x*250+0.5)/250
Out[56]: (-0.0009375000000000355, -0.0020000000000000018)

I don't know, but I suspect the original might not work unless TBLSIZE is a power of two. My new version works just fine for any number. In the C code, TBLSIZE is always a power of two, so no problem there. So for multiplying and dividing a floating point number by a power of two, there is an ldexp function:

z = x - ldexp(floor(ldexp(x, TBLBITS)+0.5), -TBLBITS);

We now use TBLBITS, see the definition in the C code above. This is about as much as we can do, while keeping this mathematically correct. A good compiler however should have a very fast (and thus platform and floating point implementation dependent) implementations of the ldexp and floor functions. Instead of floor(a+0.5), we could have also used round(a), so that we don't have to do the +0.5 addition. Either way, the compiler can now generate very efficient code. I think a good compiler might be able to optimize this to be as fast as the original. I haven't checked what assembly icc generates. The redux trick is essentially a way to implement this, in a platform dependent way, and the compiler should do that. If not, and we really really need this for speed, well, one can always use some macros and have two implementations --- one general, but perhaps slower sometimes, and one platform dependent. In fact, the openlibm library is already using exactly this approach with the STRICT_ASSIGN macro. If anyone objects to this particular point, that a compiler can't generate as efficient code as the original, let me know. I can look into this further. Update: I implemented a solution that is as fast as the original, but readable and that works with icc in https://github.com/JuliaLang/openlibm/pull/170.

Anyway, the final version now works with icc.

You can see all the details at https://github.com/JuliaLang/openlibm/issues/169. I spent the whole evening on this. But this is a great example, and I think it proves my point that I was making above. One should write mathematical code, and then use -ffast-math.

One final disclaimer: I am not claiming that there can't be an example where my approach does not work. All I am claiming is that so far every example that anybody ever showed me can either be rewritten into mathematical form + -ffast-math, or it is not relevant in practice.

StefanKarpinski commented 6 years ago

Great analysis and thanks for the PR.

As a counter-example, consider my original problem: how do you find the error of a single floating-point addition when the compiler is free to re-associate the operations arbitrarily?

vtjnash commented 6 years ago

I'm pretty sure the goal of the code was written to be fast, not readable. But it's also definitely not a hack to make use of the standards-defined behavior in a language. C compilers like to make it very clear that they'll use any undefined behavior in the C standard to optimize / delete / rewrite your code. However, I think the flip side of that is that a compiler that intentionally miscompiles standards-complient code is not a C compiler, and thus disqualified from being used as a benchmark.

write it mathematically, and assume that you can use any mathematically equivalent expressions, and at the same time, avoid numerical constructs which lose accuracy in floating point

Also, you've been arguing that the compiler should be allowed to use "mathematically equivalent expressions", but it is still constrained by floating point precision, so the transforms being made by icc are not equivalent. For example, above you state that adding 0.5 "just shifts the box". That would be true mathematically, but with real floating point numbers, those two boxes are a different size. That could result in different branch-cuts near the log/linear non-uniformities in the IEEE double format, or simply lower precision results. I don't know whether it is using that property here, but I would be hesitant to trust the output of a compiler that ignores the original author's expressed intent.

certik commented 6 years ago

@StefanKarpinski you do it exactly as I have done in the PR. In C, you can use volatile, but as they pointed out, that spills the register, so one would have to write this either in assembly, or provide a C implementation and then compile that particular file with IEEE semantics, to avoid writing assembly. In openlibm, you essentially expanded this IEEE semantics on the whole code, so then the whole code must be compiled like that, but that's against my philosophy.

@vtjnash, @StefanKarpinski this discussion made me realize that there are two schools.

One is the IEEE school, which is where most (if not all) people on this thread belong, and that schools says that you rely on the IEEE semantics, and assume it always works, and forbid the compiler to do any kind of optimizations that would change the order of floating point operations, or do anything that.
The other school, let's call it the Fortran school, assumes that the compiler is free to do any kind of mathematical rewriting. As I pointed out above, it turns out the Fortran standard itself actually requires to keep the parentheses, so the Fortran standard is not following the "Fortran school" either (what an irony!), if you see what I mean. But icc is following the Fortran school. So is -ffast-math in gcc.

In most of this thread, I think @StefanKarpinski assumed that there is only the IEEE school, and that the Fortran school is "extremely naïve", "myopic", or that there is no such school, it's only "my code", but everybody else is using the IEEE school. I am not offended in any way, I am just trying to explain what happened in this thread. Doing this made me realize a few things about the IEEE school that I didn't realize before, and also I am trying to figure out the rules (and write them down later!) of the Fortran school. The IEEE school has pretty clear rules. The Fortran school, it turns out, also has very firm rules, but they are not written down anywhere very clearly. But the rules are clearly there. That's why I will attempt to write them down.

The two schools are not equivalent and that's what's causing the disagreement in this thread. But they are almost equivalent, but they stress different things. It's like a different angle to approach things, and if you go far enough, the two angles can almost meet. But never become 100% equivalent.

Let's first answer this one:

how do you find the error of a single floating-point addition when the compiler is free to re-associate the operations arbitrarily?

In the IEEE school, you simply rely on IEEE semantics and do e = ((a + b) - a) - b, and enforce the IEEE semantics in the compiler. (I am short here, since I don't have to explain the IEEE school in this thread.)
In the Fortran school, you can't rely on IEEE semantics, that is a platform specific low level detail. The Fortran language provides lots of intrinsics to query the floating point features in a platform independent way, examples are: tiny, huge, epsilon, selected_real_kind, precision and many others. So if you need to know such an error, then there should be an intrinsic fp_add_error(a, b), that has to be implemented by the compiler, or a library. You don't implement this in Fortran. You can implement this intrinsic in any way you like (whether in C with IEEE semantics, or in assembly, or any other way in the compiler itself, does not matter). And you build your Fortran code using such platform independent intrinsics / building blocks.

So the Fortran school stresses portability and it needs to look like a math. However, the Fortran school also stresses speed. That's why we give up IEEE semantics, that's why Intel gives it up by default. Note that Intel initiated the IEEE standard in the first place (as I learned on @StefanKarpinski's link above, great link btw), but they still give it up by default. You might think what an irony. But it's not an irony, it's just two schools, and you need to support both. Anyway, so you need speed in Fortran, so as an example, in the PR above, we ended up with the ldexp(floor(ldexp(x, p)+0.5), -p) operation, and now we needed to make it as fast as possible. The Fortran school relies a lot on the compiler, and the compiler can do a lot --- unlike with IEEE semantics, in Fortran we are limited in what we can use, and as such, the compiler has great freedoms to greatly optimize the code. So in this case, the compiler knows exactly what we are doing, since ldexp(floor(ldexp(x, p)+0.5), -p) is mathematically well defined. So a good compiler can do amazing things with it. But let's say icc can't produce optimal code for this particular example. Then we can provide an efficient implementation of this, as I have done in the PR, and simply create a function ldexp_floor_ldexp(x, p) that does this quickly, but in a platform specific way, using the IEEE specific redux trick.

You can see that the schools can get quite close, but are not exactly equivalent. In the Fortran school, if the compiler can't quite optimize the platform independent way of doing things, you might need to help it by providing such intrinsics. However, in practice, the -ffast-math really is fast and faster than not using it. So the Fortran school gets the speed this way.

Now to @vtjnash's excellent comments:

However, I think the flip side of that is that a compiler that intentionally miscompiles standards-complient code is not a C compiler, and thus disqualified from being used as a benchmark.

Yes, a Fortran school compiler can't be used for an IEEE school benchmark. No contradiction here.

Also, you've been arguing that the compiler should be allowed to use "mathematically equivalent expressions", but it is still constrained by floating point precision, so the transforms being made by icc are not equivalent.

In the IEEE school they are not equivalent. They are however equivalent in the Fortran school. No contradiction here. The Fortran school is different than the IEEE school, and the advantage of the Fortran school is precisely that you can now use such mathematically equivalent expressions to greatly speedup the code, but at the same time, the Fortran school requires you to write your code in a platform independent way (or more specifically in a floating point implementation independent way) without relying on IEEE semantics. It's a trade off.

For example, above you state that adding 0.5 "just shifts the box". That would be true mathematically, but with real floating point numbers, those two boxes are a different size.

Indeed. So in Fortran school you can do that, but in IEEE school you can't.

That could result in different branch-cuts near the log/linear non-uniformities in the IEEE double format, or simply lower precision results.

It could and it matters for the IEEE school. But it does not matter for the Fortran school, because you write the algorithm in a way not to rely on such things.

I don't know whether it is using that property here, but I would be hesitant to trust the output of a compiler that ignores the original author's expressed intent.

And this is the core of the argument. The key difference between the IEEE school and the Fortran school. In the IEEE school, you actually don't know in practice what exactly it is doing. Yes, it is well specified, but you don't know in practice, because you can't tell, whether my reimplementation is still 100% equivalent or not. In other words, you can't tell if x-((x+redux)-redux) and x-math.floor(x*256+0.5)/256 is equivalent. Both are well specified in IEEE. But you can't tell in practice, because, as you said, it "could result in different branch-cuts near the log/linear non-uniformities in the IEEE double format, or simply lower precision results." In other words, the floor, and the division may change things slightly to how the x-((x+redux)-redux) does it. Absolutely! That is the key. Now, regarding author's expressed intent. Well, what exactly was his intent? The two schools differ here too. The IEEE school says that his intent was exactly x-((x+redux)-redux), period. The Fortran school says that his intent was to implement the math, as described by the comment in the code. So it needed to put the numbers into boxes. Which is mathematically done using the x-math.floor(x*256+0.5)/256 idiom. And so that what the intent was. So in Fortran, that's how you write it. And then you leave it to the compiler to produce correct mathematical code (i.e., the correct result with the best accuracy it can get).

I would be hesitant to trust the output of a compiler that ignores the original author's expressed intent.

Again, a huge difference between the schools. In IEEE school, indeed you must require and IEEE compiler, otherwise the code will completely fail, as exemplified by the openlibm example. In Fortran school, the code itself exactly represents your mathematical intent. And the compiler thus can't ignore it. It must obey it, and by definition, it is also free to use any other mathematically equivalent representations to implement it. So icc can't miscompile my code. (Yes, it can, if there are bugs in the compiler, but let's say there are no bugs.). That is the key to understand: in Fortran school, the icc or ifort complers can't miscompile the code written in Fortran school.

@StefanKarpinski also mentioned that he considers the default icc behavior a bug. It is indeed a bug in the IEEE school. But it is not only not a bug in the Fortran school, it is the desired feature!

Anyway, sorry for the long comments. But things are becoming very clear in my head. I think I nailed it. I never realized nor written down explicitly like this that there are two schools, IEEE and Fortran, and that they differ so much. But at the same time, in practice, they overlap a lot. But they still differ, as I have exemplified above. And a given code can be a bug in one school, but a feature in the other, and vice versa.

vtjnash commented 6 years ago

Which is mathematically done using the x-math.floor(x*256+0.5)/256 idiom

Mathematically, perhaps, but they aren't equivalent in hardware on real floating point numbers. That's why the IEEE standard was created to replace the wild-west of "fortran school" implementation-defined behaviors. In particular, the rewrite assumes that x*256 + 0.5 is being performed in arbitrary-precision arithmetic, and won't overflow (false for extra large numbers, return Inf), and that adding 0.5 will be precisely computed (possibly false for small numbers, for example those near 0.5/256).

he considers the default icc behavior a bug

It is a bug. It is a clear violation of both the Fortran and C standards. There may be times when it is unclear if a compiler is taking advantage of UB when doing an optimization. But this is not UB. I don't expect that they'll fix this, since it'll impact their benchmark results. But it does call into question the validity of their performance claims, since they have been found to not be doing an apples-to-apples comparison.

certik commented 6 years ago

@vtjnash yes, in IEEE school I 100% agree with your comment. You are simply logically following the IEEE school's assumptions and I agree with your conclusions.

In Fortran school, you don't care about how this is implemented in hardware. It is not wild-west (that's the only thing I disagree with you in your comment -- the Fortran school is well defined too, or can be well defined, as I am discovering the rules myself). The x-math.floor(x*256+0.5)/256 operation is mathematically well defined. Period. Now the question is, how to actually compile it. You have freedom here. I think it's precisely this freedom that you refer to as "wild-west". But I object to it, since this freedom does not mean to produce incorrect results. You must produce the best result you can, on the given hardware. The link that @StefanKarpinski posted, that was truly wild-west, since he described that if you didn't multiply by 1.0, it would return an incorrect answer. Well, that is against the Fortran school. So I object that they were following a Fortran school, using my definition above. That was not Fortran school, just an incorrect behavior. But yes, how exactly x-math.floor(x*256+0.5)/256 is implemented is not fully specified in Fortran school, but the Fortran school has some requirements --- it must be numerically as accurate as you can get. And if say icc does not give as good accuracy, then that's a bug. Now, in practice, you don't get the precise IEEE rounding. So you might say you lose accuracy. Yes, you do, but only a little bit, and that is allowed by the Fortran school. So instead of 1e-15, you get only 1e-14 sometimes. Yes, I think that is fine with the Fortran school. What is however not fine with Fortran school if instead of 1e-15, you get 1e-6 or 1e-4, or even 0 accuracy. That's unacceptable.

vtjnash commented 6 years ago

as I am discovering the rules myself

Having implementation-defined behaviors and having actual rules (standards), are not substitutions. That's how we ended up with the mess that is the 386 and the horribly broken C standards written to try to formalize and accommodate it.

You must produce the best result you can, on the given hardware

But your implementation doesn't specify that, and doesn't do that. It specifies precisely that it should compute x * 256 + 0.5 using double-precision IEEE-format rounding and overflow. If you want the mathematically correct answer, then you need to use a mathematically correct library or language (such as Mathematica, SymPy, or Nemo).

it must be numerically as accurate as you can get

Then is must not be compiled with fast math flag. That flag explicitly discards accuracy in the interest of improving performance.

but only a little bit

There's actually no bound on how much accuracy might be lost in fast-math mode. You might instead get 0 accuracy. This depends on you checking the assembly output for your compiler on your code to see if it made a replacement that is not valid for your problem domain.

StefanKarpinski commented 6 years ago

It should be noted that every -ffast-math optimization example in this thread can be explicitly implemented by the programmer by straightforward mathematical simplification. If I really wanted ((x + y) - x) - y to be simplified to zero, I could just do that and write 0 instead. The compiler is not doing anything that the programmer couldn't do. On the other hand, in -ffast-math mode, there is no way to express that you want to compute the error of float addition short of using volatile (with a huge performance hit) or "write it in assembly". If "write it in assembly" is the answer, I think you've lost the argument. This was precisely my point above when I wrote:

IEEE is not the enemy of performance, it is precisely the set of guarantees you need in order to know that you are doing exactly the minimal amount of work needed to get the right answer.

The optimization that -ffast-math does in this case is easy to ask for explicitly, but the behavior that IEEE guarantees is impossible to express in -ffast-math mode. These are not symmetrical positions: IEEE mode is strictly more expressive than unconstrained "fast math".

There is, however, a class of optimizations that it's quite hard to express explicitly with strict IEEE semantics – namely the kinds of reassociations required for SIMD loop optimization: the compiler needs to be free to assume that it can reassociate reductions across loop iterations so that it can emit vectorized code instead of sequential code. Julia has a @simd annotation specifically to explicitly allow this transformation. But throwing the additional expressiveness of IEEE semantics out with the SIMD bathwater doesn't seem like a good tradeoff at all.

certik commented 6 years ago

I wrote a long post again, then deleted it, as I don't want to waste your time. I am discovering the rules of the Fortran school myself, I just know there are rules, it's just about discovering them. Here is a better definition for @vtjnash of the IEEE vs Fortran school rules:

When IEEE school writes an expression like ((x + y) - x) - y, then it is strictly in the double precision floating point sense (assuming the variables are double precision), and each operation is strictly defined by the IEEE standard. From this definition, you can literally write down the answer on paper, without ever running it on CPU, by doing the IEEE specified floating point operations. However, at the same time, it can be very hard to figure out the mathematical expression that is being solved, since for a specific "y", this can turn out to be equal to x-math.floor(x*256+0.5)/256, and even you @vtjnash is not 100% sure it is actually equal all the time, in the IEEE sense. However, neither you @vtjnash nor me have any doubts how to interpret ((x + y) - x) - y in hardware or on paper in the IEEE sense.
When Fortran school writes an expression like ((x + y) - x) - y, then it represents a mathematical expression. It specifically does not represent a particular way how to implement this in hardware. The compiler is completely free to rewrite it to any other mathematically equivalent expression, such as 0, but there is one condition: whatever the compiler chooses, it must not lose numerical accuracy compared to the original expression. So if the user writes ((x + y) - x) - y, then the compiler can generate code for 0 instead, because 0 is represented very well in hardware, and you get 1e-16 accuracy (or in some sense infinite accuracy, as 0 can be represented exactly). But if the user writes 0 in the code, the compiler cannot use the mathematically equivalent ((x + y) - x) - y expression, because ((x + y) - x) - y will lose some accuracy. @vtjnash objected, that this can't be right, since the speed and accuracy is a trade off, and that icc sacrifices accuracy for speed. I don't know the exact answer to that yet. But I know this: in lots of cases, you gain both speed and accuracy, for example rewriting ((x + y) - x) - y to 0 you gain both speed and accuracy. And so then the rule of the Fortran school is to do it. Let's take another example, the Kahan summation. Say you have an array of sorted numbers from low to high. And you specify a loop in the code to sum them from the smallest to the highest. If the compiler vectorizes the loop, it will break the order, and thus lose accuracy. I think the Fortran school answer to this problem is that the compiler will vectorize the loop. So you must not rely on the order of operations in summation in your algorithms.

This brings us to @StefanKarpinski's point, how do you force a Fortran school compiler to do the IEEE operation ((x + y) - x) - y, or the Kahan summation (if the compiler is free to vectorize the loop)? The short answer is that you can't. The longer answer is given by me above already, you simply build building blocks, and you don't need assembly (I just gave it as an option), you can use an IEEE school compiler to compile a given function exactly the way you want, and then expose to Fortran (or a Fortran school language of your choice, which can be C compiled with -ffast-math) and use. The point is that you don't need these very often. I've never needed Kahan summation, and I only needed ((x + y) - x) - y once in my life, in the openlibm implementation. And even there, I would not use ((x + y) - x) - y, I would simply use x-math.floor(x*256+0.5)/256 instead, since I expect the -ffast-math will deliver enough performance, that I can afford to lose a little bit of performance by doing x-math.floor(x*256+0.5)/256 instead of ((x + y) - x) - y.

@StefanKarpinski, @vtjnash are you claiming, that it's impossible or hard to write the exp2 function in readable Fortran, but still be faster than the openlibm version (because in Fortran one can use -ffast-math while you can't in openlibm)?

Or perhaps you believe that that can be done, but that it's impossible to deliver guaranteed accuracy, and so it's better to sacrifice a little bit of speed for that? Well, I did use icc with an equivalent of -ffast-math on the exp2, and it passed all your tests. So I don't think it's really losing any accuracy at all, at the end of the day, the floating point operations are done in the hardware either way, and the actual accuracy is determined by the polynomial in exp2, and the way the numerical algorithm is constructed, numerical error in the polynomial delivers 0.503 ulp as the maximum error in the final answer (https://github.com/JuliaLang/openlibm/blob/b2520473c2b2f755ddf385f366fee097f68b71ba/src/s_exp2.c#L312), and so it will perform similarly with -ffast-math.

I have done similar experiments in the past, e.g. I have implemented fast double precision modified Bessel functions in Fortran, and compared against SymPy, see pages 152-156 in my Ph.D. thesis. I haven't seen any changes with -ffast-math in terms of accuracy, but it did get a lot faster. The absolute error is always < 1e-16, and the relative error always < 1e-15. It is using a minimax polynomial, similarly to the exp2 function. It does not use argument reduction, that would make it even faster / more accurate probably.

Perhaps your argument can be that with -ffast-math, I cannot guarantee the accuracy, while with IEEE, you can guarantee the exp2 accuracy, no matter what platform or compiler, as long as it is IEEE compliant, the total error will not change and stay below 0.503 ulp. While in my case, in principle it might get a bit bigger, but in practice it will stay on the 1e-15 level, but yes, strictly speaking I can't guarantee even that, while I think you can with IEEE. Is that your point?

vtjnash commented 6 years ago

So if the user writes ((x + y) - x) - y, then the compiler can generate code for 0 instead, because 0 is represented very well in hardware, and you get 1e-16 accuracy

You get infinite error in this case (x / 0 = Inf). The original expression can compute a non-zero value. A compiler that assumes that floating point math is done in infinite precision is wrong. In practice, yeah it probably doesn't make a difference over the problem domain for most programs.

x-math.floor(x*256+0.5)/256 instead

This not an equivalent expression for all x. I assume the original author used a precise bit-twiddling formula because it would return the intended answer for all values of x on any standards compliant compiler.

but in practice it will stay on the 1e-15 level

I don't understand where the stated "1e-15 level" comes from. There's no limit on how far a fast-math result could diverge from the IEEE-compliant result. I refer you to the above example again, where the error was infinite.

vtjnash commented 6 years ago

but it did get a lot faster

Stefan's point is that with an IEEE compliant compiler, you could have gotten the exact same speedup simply by improving your code. For example, I see you have a lot of divisions in the original code. Those are slow, so the fast math flag will replace those with a lower-precision operation which is much faster. You could have achieved much the same result just by reordering the code slightly to make better use of the hardware. For example, instead of x**12/518918400, write x**12 * (1/518918400).

StefanKarpinski commented 6 years ago

I don't want to offend, @certik, but from this discussion it's not clear to me whether you're aware that all hardware these days implements IEEE 754 floating-point arithmetic. At a hardware level, there is nothing else. The only question is whether your compiler generates machine code that matches what you wrote or generates something else that it thinks you may have intended.

certik commented 6 years ago

@vtjnash can you point me to one x, when x-math.floor(x*256+0.5)/256 is not equivalent to x-((x+redux)-redux)? I know you mentioned a few times it's not equivalent, but I haven't been able to find x that breaks this. You can use Python and define redux = float.fromhex("0x1.8p52") / 256.

I don't understand your argument about how changing ((x + y) - x) - y to 0 gives an infinite error. Perhaps you meant that in IEEE school, ((x + y) - x) - y is equal to some (in general nonzero) eps, and so substituting eps for 0 gives me an infinite relative error (eps-0)/0 = inf? Yes, that is true for IEEE, but in this context I was using Fortran school, and in there ((x + y) - x) - y is equal to 0.

Regarding your other point, it seems that both you and @StefanKarpinski (based on his last comment) seem to think that the only thing that -ffast-math does is to reorder operations, and so that you don't have to use -ffast-math at all, and simply tell a compiler to order things the way you want, and it will be as fast. Am I right?

I don't think that's correct at all. But I need to study this in more detail to give an answer.

oscardssmith commented 6 years ago

if x=max(float(64)) and y>eps(max(float64)), then x+y=inf, so in IEEE ((x + y) - x) - y is inf, while in Fortran it is 0

simonbyrne commented 6 years ago

@vtjnash can you point me to one x, when x-math.floor(x*256+0.5)/256 is not equivalent to x-((x+redux)-redux)? I know you mentioned a few times it's not equivalent, but I haven't been able to find x that breaks this.

How about x = 0.0019531249999999998?

julia> x = 0.0019531249999999998
0.0019531249999999998

julia> x-floor(x*256+0.5)/256
-0.001953125

julia> redux = 0x1.8p52/256
2.6388279066624e13

julia> x-((x+redux)-redux)
0.0019531249999999998

This is a common problem with trying to "roll-your-own" round function by using floor(y + 0.5): if y is the floating point number below 0.5, the result will be incorrectly rounded to 1 instead of 0.

vtjnash commented 6 years ago

@certik Here's a script that will enumerate all of the cases where your code is broken. It should only take a couple minutes to run.

f1(x) = x - floor(x * 256 + 0.5) / 256
f2(x) = (redux = 0x1.8p52 / 256; x - ((x + redux) - redux))
for i = UInt32(1):typemax(UInt32)
    fi = reinterpret(Float32, i)
    f1(fi) == f2(fi) || println(i)
end

Here's an example (pulled at random):

julia> fi = reinterpret(Float32, UInt32(1187508885))
25593.291f0

julia> f1(fi)
-0.001953125

julia> f2(fi)
0.001953125

Note that it couldn't even get the sign right.

certik commented 6 years ago

Ok, so one difference on the hardware level are all the trapping modes -fno-trapping-math (and -fsignaling-nans and so on). Is there no performance hit when you keep all these options on for IEEE? I thought there is.

Another difference are handling of denormal numbers which are disabled by -ffast-math for performance: https://en.wikipedia.org/wiki/Denormal_number#Disabling_denormal_floats_at_the_code_level, what I read about that is that there is a performance hit if you allow them.

@StefanKarpinski can you comment on that?

StefanKarpinski commented 6 years ago

Regarding your other point, it seems that both you and @StefanKarpinski (based on his last comment) seem to think that the only thing that -ffast-math does is to reorder operations, and so that you don't have to use -ffast-math at all, and simply tell a compiler to order things the way you want, and it will be as fast. Am I right?

In clang it permits the compiler to make the following invalid but "reasonable" assumptions:

+ and * are associative
* distributes over +
x/y == x * (1/y)
NaN and ±Inf do not occur
-0.0 is the same as 0.0

There's no magic – all the compiler is doing is a bit of algebraic simplification with these rules. But the programmer can also use these same assumptions to simplify expressions explicitly – often resulting in a faster and/or more correct computation. But not always, which is why letting the compiler do it for you blindly is risky.

certik commented 6 years ago

Another issue is on the code level --- for IEEE, you have to handle the NaNs and infinities by tons of if statements, e.g.: https://github.com/JuliaLang/openlibm/blob/b2520473c2b2f755ddf385f366fee097f68b71ba/src/s_exp2.c#L347, I can see at least 2 if statements that get executed every time. You have to keep this for IEEE conformance. But if you drop it, you can completely remove those, so that speeds up the code, as the compiler can then inline better.

StefanKarpinski commented 6 years ago

Denormals are a different story. You're generally in a bad situation if your computation has hit denormals at all, so going straight to zero can be justifiable. I believe that you only pay the performance cost for denormals if you actually encounter them, but I may be wrong.

yuyichao commented 6 years ago

trapping math

I'm pretty sure modern hardware don't do that by default so no.

what I read about that is that there is a performance hit if you allow them.

AFAICT only intel has that issue and it's not handled by generating different code.

yuyichao commented 6 years ago

I believe that you only pay the performance cost for denormals if you actually encounter them, but I may be wrong.

That's correct and is basically intel only (not even x86 only iirc). There's not even anything you can do in the code to ignore denormal numbers at instruction level.

simonbyrne commented 6 years ago

Another issue is on the code level --- for IEEE, you have to handle the NaNs and infinities by tons of if statements, e.g.:

You only have to do that if you're writing an IEEE-correct math library. In most cases, you can usually ignore them and they'll propagate correctly, or at worst give you a NaN result.

yuyichao commented 6 years ago

by tons of if statements

Those tend to be very short if statements that has only minor effect on code size. It is also mostly only matter for code that is strictly impossible to switch away from ieee behavior (since they do that check only for predicatiblity). If you don't have NaN, which is the assumption you have to make to enable ffast-math without UB, those branches will be predictabled very well so those won't hurt performance very much.

StefanKarpinski commented 6 years ago

for IEEE, you have to handle the NaNs and infinities by tons of if statements

This has nothing to do with whether you compiler is in IEEE mode or not. You don't have to write code with checks like that if you don't care about handling non-finite values. Of course, in -ffast-mode even if you write that code, the compiler will assume that NaNs and Infs don't occur, and completely delete those branches. The resulting code might be a bit faster, but it won't give the right answer for non-finite values – even though you explicitly wrote code to handle them. In short:

in IEEE mode, you can write correct code for handling non-finite values but you don't have to;
in -ffast-math mode you cannot write such code at all because the compiler will delete it.

vtjnash commented 6 years ago

so that speeds up the code

It's cool that Julia makes it really easy to test things like this. So lets do it:

Profile.@profile for i = 1:10^8; exp(1 / i); end
julia> Profile.print()
      5   ./special/exp.jl:73; exp(::Float64)
      21  ./special/exp.jl:74; exp(::Float64)
      18  ./special/exp.jl:83; exp(::Float64)
      6   ./special/exp.jl:89; exp(::Float64)
      20  ./special/exp.jl:93; exp(::Float64)
      4   ./special/exp.jl:127; exp(::Float64)
      34  ./special/exp.jl:132; exp(::Float64)
      253 ./special/exp.jl:133; exp(::Float64)
      677 ./special/exp.jl:134; exp(::Float64)

So this test contributes between 1-2% of the cost of this function.

certik commented 6 years ago

@simonbyrne, @vtjnash good points --- one has to use the round function. Python didn't have it, so that's why I simulated with floor, and it fails in corner cases indeed. Well, try this:

f2(x) = (redux = 0x1.8p52 / 256; x - ((x + redux) - redux))
f3(x) = x - round(x*256)/256
julia> for i = UInt32(1):typemax(UInt32)
           fi = reinterpret(Float32, i)
           f3(fi) == f2(fi) || println(i)
       end

So far it didn't find anything.

yuyichao commented 6 years ago

Also I'm not really sure what is being argued here anymore. No one disagree that fastmath can have a big performance improvements when used properly. However, in order to be predictable, which I do strongly believe is a property that a language/implementation should do, there are basically only two ways one can do it.

Follow ieee,
Follow math, exactly.

2 isn't implementable on current hardware due to hardware limitation so the only way to achieve that is 1, following ieee, which is as close as possible to math definition taking into account hardware limitation. That's why languages should use 1 as default and since fast-math is not (2) it can be offerred as a non-default option/mode with unpredictable behavior (by construct). It also means (as Jameson mentioned earlier) that fast-math should generally be used with care. Not necessarily having to do it at instruction level but you do need to have a fair bit of knowledge of what the compilier can do if you want to actually predict the output.

vtjnash commented 6 years ago

@certik it probably helps to check the top of the range first, since that's where we know most of the failures will occur:

julia> for i = typemax(Int32):Int32(-1):Int32(1)
                fi = reinterpret(Float32, i)
                f1(fi) === f2(fi) || println(hex(i))
       end

certik commented 6 years ago

@vtjnash good point. Here is an even improved script:

julia> for i = typemax(Int32):Int32(-1):Int32(1)
           fi = reinterpret(Float32, i)
           f3(fi) == f2(fi) || println(i, " ", reinterpret(Float32, UInt32(i)))
       end

So far it prints bunch of NaNs, but we don't care about them.