The gfortran benchmarks should use the "-march=native -ffast-math -funroll-loops" options

certik commented 6 years ago

Specifically, when I compare

gfortran -O3 perf.f90 && ./a.out
gfortran -O3 -march=native -ffast-math -funroll-loops perf.f90 && ./a.out

on my machine, then I get 2x speedup on the iteration_pi_sum benchmark. I haven't tested the C code, but a similar speedup might be possible.

Note: if you use a recent gfortran compiler (e.g. I just tested this with 7.2.0), you can just use:

gfortran -Ofast perf.f90

And it will turn on the proper options for you, and produce optimal results.

certik commented 6 years ago

@yuyichao thanks, that's the technical discussion that I am happy to keep being involved in.

Yes, one thing that I like about IEEE is that when you write code, no matter how, as long as the order of operations is given, then indeed all you have to do is to test it once, and since IEEE mandates the results don't change, then you can trust it.

However, the issue with this approach is that you still have to test it somehow, and as @StefanKarpinski said, you can't test all double precision numbers. So you have to have a solid testing that you trust. Once you do, I think you can test a particular compiler with -ffast-math and see if it works. I agree with you that you don't have guarantees in theory.

My claim about this was that in practice, -ffast-math does not lose accuracy, or give incorrect results with any of the compilers (gcc, Intel, ...). Otherwise that would be a bug in the compiler. Given the nature of my claim, it can't be "proven", but it can be disproven if you give me an example of a code that follows the Fortran school guidelines that gets miscompiled. This lead us to openlibm, and exp2, and I have shown that if you adapt the code to follow my guidelines, it will not get miscompiled. So this eliminates exp2 as a counter example why my claim does not hold. If you find another counter example where -ffast-math miscompiles a valid code that's not IEEE dependent, let me know, I am happy to look at it.

What you want, however, is more than a lack of counter example. What you want is a proof. I can't give you the proof that you want. All I can say is that if you write code, say in Fortran, using the Fortran's floating points primitives, and not use any kind of IEEE specific behavior, then if that gets miscompiled by -ffast-math, then it's a bug in a compiler.

Regarding IEEE checking for inf/nan --- @StefanKarpinski claimed above that sticking to IEEE will not incur performance hit, because those branches will get predicted correctly. But that is not what I observed in practice, as there was significant speedup commenting the if checks:

-ffast-math + round256(x):
0.431
-ffast-math + round256(x) + no inf/nan checks:
0.375

Again, this is just one example, I agree.

But if you claim something, and then I find a counter example, then the ball is in your court to show why my example does not apply, or you should retract the claim. Just like if you find an example of a valid code that gets miscompiled, the ball is in my court to show that either the code is not following the Fortran school guidelines, or it's a bug in a compiler, or that my claim is wrong.

yuyichao commented 6 years ago

Is BLAS not a coherent, reliable school of thought in numerical computing? Come on.

BLAS is completely different, it doesn't mandate a particular order but it only allow a well defined set of transformations. fast-math OTOH does not have this property.

It is pretty easy to proof that the error in fast-math is unbound. AFAICT, the compiler is free to replace x with x + b - b. Will the compiler do that in simiple cases, probably not. Will the compiler effectively do that in more complicated cases, absolutely and I've just seen yesterday a case where the compiler does sth very similar due to register pressure (to avoid a stack spill probably).

However, the issue with this approach is that you still have to test it somehow, and as @StefanKarpinski said, you can't test all double precision numbers. So you have to have a solid testing that you trust. Once you do, I think you can test a particular compiler with -ffast-math and see if it works. I agree with you that you don't have guarantees in theory.

No, the moment you add -ffast-math, the test you've done is only valid to that particular implementation and NOTHING else. Without fast-math, the test you've done on one compiler can be applied to every conforming complilers.

it can't be "proven"

And that's the single thing that is fundamentally required.

if you give me an example of a code that follows the Fortran school guidelines that gets miscompiled

Logically, failing to give a conterexample says nothing about the validity of the approach. I totally agree it's unlikely, but I've also seen (or at least heard of since I'm not old enough) countless cases where people assumes something is good enough essentially due to compiler/hardware limitation at the time got completely screwed. That's why a well behave program need to be based on proofs and not just the claim that "things seems to work on one implementation right now"

certik commented 6 years ago

@yuyichao I essentially agree with your last comment (for me and lots of other people, the level of proof that you need differs, but I have no issues with what you wrote). Btw, if -ffast-math changes x to x+b-b, then it's a compiler bug. Can you show me the example you found in practice? I would like to have a look.

mbauman commented 6 years ago

I particularly like this SO question: Can -ffast-math be safely used on a typical project?.

The push-back you're getting is simply that -ffast-math isn't safe to have on in a scientific library. You always need to know how the computer is going to do arithmetic. With -ffast-math, you effectively need to know two sets of rules: the basic IEEE semantics, and also how much math any given compiler has learned. The particularly challenging part is that the latter rules aren't well-defined! You aren't re-writing algorithms to be IEEE-independent, you're writing them to be dependent upon both IEEE and the current set of relaxations that your compiler's -ffast-math option provides.

My claim about this was that in practice, -ffast-math does not lose accuracy, or give incorrect results with any of the compilers (gcc, Intel, ...). Otherwise that would be a bug in the compiler.

Folks in this thread have given many counter-examples. I particularly like this example from the above SO question:

float X, XMin, Y;
if (X < XMin)
{
    Y= 1 / (XMin - X);
    XMin= X;
}

Without looking, can you see how this code might possibly raise an error under -ffast-math?

JeffBezanson commented 6 years ago

if -ffast-math changes x to x+b-b, then it's a compiler bug

What set of rules disallows this, but allows other transformations? Is it written down somewhere?

certik commented 6 years ago

@mbauman yes, the problem here is that it can divide by zero sometimes. The issue is that you have to be very careful ---- just a condition a < b is not enough to ensure that 1 / (b-a) is safe. The reason has to do with denormal numbers I think --- I made some numerical examples about exactly this issue. Let me try to look them up.

StefanKarpinski commented 6 years ago

Ah, so you it's relative error then? Here's a case where the relative error of left-to-right summation is unacceptable by your definition:

julia> v = sort!(randn(2^20));

julia> (foldl(+, v) - foldr(+, v))/sum(v)
2.1892709224229008e-12

This is just one set of numbers – this can be arbitrarily bad. How do you know if your data are problematic or not? You claim that pairwise summation is an "acceptable algorithm" – and use it to check correctness – but under -ffast-math there is no such algorithm since the compiler is free to optimize it to left-to-right summation, which is an "unacceptable algorithm".

StefanKarpinski commented 6 years ago

Worse still, here's a dirt simple case with infinite error:

julia> v = rand(2^20); v = sort!([v;-v]);

julia> (foldl(+, v) - foldr(+, v))/sum(v)
-Inf

certik commented 6 years ago

@StefanKarpinski in your last example you are making a mistake of using a relative error when the answer is around 0. You have to be careful there. When comparing two floating point numbers, in practice I use relative error if the answer is away from 0, but you can't use it around zero, so there I just use absolute value typically.

The true answer is 0. Do you agree? Well, the summation algorithms got:

julia> foldl(+, v)
3.540501225529624e-11

julia> foldr(+, v)
-3.540501225529624e-11

So there the accuracy is around 1e-11. Here is how you can look at it:

julia> a = foldl(+, v)
3.540501225529624e-11

julia> b = foldr(+, v)
-3.540501225529624e-11

julia> a1 = 1 + a
1.000000000035405

julia> b1 = 1 + b
0.999999999964595

julia> (a1-b1)/b1
7.081002451309951e-11

The shift by 1 is arbitrary, you can shift the original input by 1 to get the same results:

julia> v = v + 1
julia> a = foldl(+, v)
2.097152000000031e6

julia> b = foldr(+, v)
2.0971519999999602e6

julia> (a-b)/b
3.386180225106792e-14

There you actually get even better relative error. Not sure why at the moment.

certik commented 6 years ago

@StefanKarpinski regarding pair summation --- how can -ffast-math optimize it out? The loops are explicitly given. I see, when it could somehow optimize out the recursive iterations, say in 10 years? I would need to think about it. I am not an expert in summation --- in practice I always use the naive algorithm, and assume it is completely independent. Even in the examples you posted, the Julia's sum() gives comparable results to foldl(+, v) and foldr(+, v). What sorts of problems have an actual practical advantage of using the pairwise algorithm, that would otherwise give incorrect results?

certik commented 6 years ago

@JeffBezanson:

What set of rules disallows this, but allows other transformations? Is it written down somewhere?

Except in this thread where I tried to write them down above, I am not aware of anything. I agree it would be very nice to write the rules down, and that is why I was engaging in this thread so much, so that I can figure out the rules and write them down. Because you need to know how to write code properly if you want to use -ffast-math. Until this thread I didn't realize the need to write them down.

StefanKarpinski commented 6 years ago

You're missing the actual point which is that even for a modest amount of very reasonable data, this algorithm has completely unbounded error:

julia> v = rand(2^20); v = sort!([1e-9;v;-v]);

julia> abs((foldl(+, v) - sum(v))/sum(v))
0.05028216044108073

julia> abs((foldr(+, v) - sum(v))/sum(v))
0.05028216044108073

julia> v = rand(2^20); v = sort!([1e-12;v;-v]);

julia> abs((foldl(+, v) - sum(v))/sum(v))
Inf

julia> abs((foldr(+, v) - sum(v))/sum(v))
Inf

regarding pair summation --- how can -ffast-math optimize it out?

Because that is exactly what -ffast-math gives the optimizer permission to do. It could (and this is not as crazy as it sounds) see that you are calling a function that it knows computes a sum, and replace the function call to a different function which does left-to-right summation. Under the -ffast-math rules, that's a perfectly valid thing to do.

The loops are explicitly given.

Doesn't matter. The compiler does not guarantee preserving loops at all. In fact, a decent compiler will completely eliminate loops whenever it can.

I see, when it could somehow optimize out the recursive iterations, say in 10 years?

Today – easily. Just look for calls to the pairwise_sum function and replace them with calls to the lefttoright_sum function. This is a perfectly valid program transformation under -ffast-math rules. And aggressively optimizing compilers do actually do this sort of thing.

I would need to think about it. I am not an expert in summation --- in practice I always use the naive algorithm, and assume it is completely independent.

It is, in fact usually fine. But that's not my point. The point is that you're definition of what is an is not an "acceptable" algorithm does not make any sense. Absent that and absent any limits on what the compiler is allowed to do with your floating-point code, you are left with only one thing: what this particular version of your compiler happens to emit when compiling your code. If it's ok, great, but that's just because you got lucky. Don't ever upgrade your compiler.

yuyichao commented 6 years ago

I agree it would be very nice to write the rules down

It's not just "very nice to", it must be if you want to claim it's predictable.

StefanKarpinski commented 6 years ago

Because you need to know how to write code properly if you want to use -ffast-math. Until this thread I didn't realize the need to write them down.

What everyone else on this thread is essentially saying is that there are no such rules that are possible. Not that you haven't thought about them hard enough – they do not exist. If the compiler is allowed to algebraically simplify any floating point code as if it were math, you are completely screwed – there's almost nothing you can actually guarantee about non-trivial numerical codes.

certik commented 6 years ago

@StefanKarpinski You can't use relative accuracy around 0, as I wrote. Perhaps you missed my comment. The data you came up with v = rand(2^20); v = sort!([1e-9;v;-v]); looks reasonable to me, and as such any summation algorithm works just fine. They are all around 0:

julia> sum(v)
9.313225746154785e-10

julia> foldl(+, v)
1.0034779673873118e-9

julia> foldr(+, v)
1.0919978254975149e-9

I don't see what you are trying to prove here at all. As far as I am concerned, there is no problem here.

JeffBezanson commented 6 years ago

there are no such rules that are possible

That is my belief as well. I don't believe it is possible to write down a formal set of rules like "((x+y)-x)-y => 0 is allowed, x => x+b-b is not" that guarantees any kind of accuracy bound. There certainly might be research questions lurking around there, but for some version of the claim I'm fairly sure it's impossible.

certik commented 6 years ago

@JeffBezanson I would need to get in touch with some compiler writers and ask them what sorts of rules they follow. I don't know.

But what I and others assume is pretty simple: if you do a sum of values, they cannot be small and large, otherwise you lose accuracy significantly. So if you write code like x+b, and later subtract b, you assume that b is not 2^52, and x = 1. So the compiler can then simplify this to just "x". The other way round, to "simplify" x to x+b-b is not allowed, because the compiler only assumes you know about "x", but not about "b" which can be arbitrary, and thus you would lose accuracy. If you write ((x+y)-x)-y, then the compiler assumes that x,y are well conditioned together, and thus this simplifies to 0.

The rules can perhaps be a simple as "don't introduce new variables" and "assume the expression as written by the user is well conditioned, don't screw it up", or something like that.

If the rules were arbitrary or if -ffast-math working was just an accident, as you all seem to assume, then it should be easy for you guys to find a simple counter example of a perfectly valid code, that completely breaks. You were unable to do so.

You keep saying there can't be any rules, yet there is no counter example. So I actually think there are rules.

Anyway, we reached a point where I don't know how else I can contribute to the discussion. You will keep not using -ffast-math, I will keep using -ffast-math, and everybody will be happy.

If anyone finds an example which breaks with -ffast-math, please let me know.

StefanKarpinski commented 6 years ago

@JeffBezanson I would need to get in touch with some compiler writers and ask them what sorts of rules they follow. I don't know.

You mean like all the compiler writers with extensive numerical programming expertise on this thread? The ones who believe that this is impossible? Yes, they might be good people to ask.

certik commented 6 years ago

@StefanKarpinski Stefan, what sorts of rules does gcc follow when they implement -ffast-math and especially how they decide about expression rewriting? I am serious.

vtjnash commented 6 years ago

"simplify" x to x+b-b is not allowed

This would be a profitable simplification for example on the following code, since it reduces register pressure (it only needs to preserve y and b over "\<do other stuff>", instead of all of x, y, and b), thus making this code faster, while not violating any fast-math rules:

y = x + b
use(y)
<do other stuff>
use(x, y, b)

certik commented 6 years ago

@vtjnash can you also write how the compiler could rewrite it? I am not sure which way you meant it.

vtjnash commented 6 years ago

the rewrite is to replace x with x + b - b:

y = x + b
use(y)
<do other stuff>
use(y - b, y, b)

mbauman commented 6 years ago

This isn't entirely dependent upon the compiler. There's also an interplay with the operations a CPU provides.

What built-in functions are used by GCC depends on the mode. By default, built-in functions are only used when they provide the semantics guaranteed by the C99 standard. However, with -funsafe-math-optimizations, substituted functions may have less precision or be restricted to a smaller domain.

Draft/Wiki: Semantics of Floating Point Math in GCC

What precision does fsin(x) have under -ffast-math? Who knows! Depends upon the processor you're using and if they've decided to enable this transform. Yes, yes, I know you're going to state that the problem is that calling sin(x) with an unreduced argument is poorly-defined. What about 1/sqrt(x)? Is 12 bits of accuracy ok for -ffast-math?

Edit: Also note that this means that what the bounds/expectations are for -ffast-math today will almost certainly change under the CPUs of tomorrow.

certik commented 6 years ago

@mbauman good comments.

Regarding fsin, that seems like a bug in the CPU. There I guess the solution is to either not use the instruction, or workaround it somehow.

Regarding sqrt(x), as Tobias said in that thread:

I think that even with -ffast-math 12 bits accuracy is not ok. There is the possibility of doing another newton iteration step to improve accuracy, that would be ok for -ffast-math. We can, though, add an extra flag -msserecip or however you'd call it to enable use of the instructions with less accuracy.

And I agree with him. I met Tobias once actually, he is a great guy. I should ask him what he thinks about -ffast-math.

I think if one wanted to reduce accuracy for speed, then one can switch from double precision to single precision. So -ffast-math should not reduce accuracy like that.

mbauman commented 6 years ago

In a hypothetical future, some new CPU could provide a super-fast and approximate exp2 operation. With -ffast-math, your compiler could happily call that instead of the one in your libm. That implementation that you just showed was "just fine in practice" and ffast-math compatible? It just got totally clobbered by something with unknown error bounds and domain because you used -ffast-math when trying to call it.

Of course, this is a little circuitous because you typically aren't implementing the standard libm. Your own functions aren't going to get clobbered entirely, but the behavior of functions they call and operations they perform can and do change for various instruction sets at the whims of compiler teams as they find it advantageous and acceptable under -ffast-math. Will 12 bits of accuracy for this new builtin be deemed ok? Almost certainly not — it seems they decided to add an extra Newton-Raphson step to hit ~2ULPs. But I don't know where they'll draw the line… it'll probably be dependent upon how big of a performance win it is. Will that change in precision — whatever it is — push your algorithm outside of your own acceptable bounds? Will you still be inside its supported domain? I don't know, and with -ffast-math enabled this (and other changes) can shift under your feet.

So I hope you can appreciate why we consider -ffast-math to be dangerous when developing a library that strives for precise results with stable tolerances. This is just one small aspect of what that flag might do. While you may have a theory for how you think it should work, in practice I don't think it forms a coherent school of thought for algorithm design. It's just a bunch of independently chosen trade-offs that happen to be faster than the alternatives on our current batch of CPU instruction sets.

certik commented 6 years ago

@mbauman thanks for the input. I read the whole IEEE standard. I have a few questions:

Section 9.2 lists functions that should be "correctly rounded". It lists, e.g. the exp2 function among those. This paragraph explains what "correct rounding" is: "2.1.12 correct rounding: This standard’s method of converting an infinitely precise result to a floating-point number, as determined by the applicable rounding direction. A floating-point number so obtained is said to be correctly rounded." But the openlibm exp2 function is not always correctly rounded (it can sometimes differ by 1 ulp). Does that mean that the exp2 function is not following the IEEE standard because it is not always correctly rounded? Update: I asked this exact question here on stackexchange, please answer there.
As I build computational algorithms on top of openlibm (a simple algorithm could be to simply return the value of exp2 for a given x), I can't know when the result of exp2 will be correctly rounded, and when not. That means I can guarantee the maximum error that my algorithm will produce, but I can't guarantee the actual floating point value. So I lost the predictability / reproducibility. As long as I keep using openlibm, it will be reproducible. But the minute I switch the standard library, my algorithm will return a different value for some x. So my algorithm is not reproducible across platforms. Does this mean that we must give up reproducibility? This is not what the standard mandates, see below. It seems like another example when the openlibm is actually not following the IEEE standard.
Section "10.4 Literal meaning and value-changing optimizations" says that when no optimizations are enabled, to preserve the literal meaning of code, and only apply such transformations that do not change the numerical results. But then it says, very explicitly: "A language standard should also define, and require implementations to provide, attributes that allow and disallow value-changing optimizations, separately or collectively, for a block. These optimizations might include, but are not limited to: ..." and then it lists a few things like applying the associative or distributive laws, but it's not limited to it. In other words, how is -ffast-math against this IEEE standard? Then it adds: "Programmers license these optimizations when the corresponding changes in numerical values or status flags are acceptable." In other words, it is up to you or me to decide whether the numerical values are acceptable. The standard does not say what is acceptable.
Section 11. says that a language should allow reproducible results, and the standard says how to achieve it (i.e. not using value changing optimizations...), but my reading of it is that this is not the only mode of operation. It just has to be available for you to use if you want it.

I have tons of other questions that kind of build on top, but if you wouldn't mind answering these, that would be extremely helpful.

yuyichao commented 6 years ago

that should be "correctly rounded"

No it doesn't? It's "Recommended correctly rounded functions".

In other words, it is up to you or me to decide whether the numerical values are acceptable. The standard does not say what is acceptable.

AFAICT it just means that such option can be provided. C/Fortran/Julia all do. It doesn't seem to say anything about how those options should be used or anything that help defining what fast-math is.

Section 11. says that a language should allow reproducible results, and the standard says how to achieve it (i.e. not using value changing optimizations...), but my reading of it is that this is not the only mode of operation. It just has to be available for you to use if you want it.

I take it that you agree fast-math is unpredictable now since the way to achieve reproducibility is to not use it.

ihnorton commented 6 years ago

@certik I think you are misunderstanding the point of contention here. No one at all is arguing against the existence of -ffast-math, or your use of it, or questioning the correctness of your particular results upon having done so. You are exactly the kind of expert user for whom these options are intended. The actual point of contention is whether this should be the default or considered an expert option. The Julia position is:

we consider -ffast-math to be dangerous when developing a library that strives for precise results with stable tolerances.

And this is exactly the same position that is taken by gfortran: notice that you must opt-in because they do not enable -ffast-math at any standard optimization level.

It is also exactly the same position that is taken by NumPy:

NumPy compiles with -fp-model strict on Intel Fortran (and -O1 on newer versions because parts of the test suite fail at default -O2).
NumPy does not compile with -ffast-math on gfortran, and indeed a NumPy core developer (Julian Taylor) explains why with substantially the same reasoning as everyone has repeated in this thread:

fast-math allows the compiler to do all kinds of dangerous shenanigans, like numerical unstable complex divisions, use of very inaccurate reciprocal approximations (rcp/rsqrt), reorderings and ignoring of error codes.
It should only be used in special cases where accuracy really does not matter or the impact of it has been evaluated. It should never be the default of a build system.
unsafe-math-optimizations is the safer option, but should still not be the default.

Even the current default unroll-loops is questionable (and we do have a bug about it).

certik commented 6 years ago

@ihnorton I appreciate your comment. For a library like openlibm, if you choose to provide IEEE guarantees (and you did), then I fully agree with you that you have to be very careful with -ffast-math. Also, if you want reproducibility bitwise, i.e. getting exactly the same floating point number, you can't use -ffast-math. I mean that is obvious. I don't think I ever claimed otherwise in this thread.

The contention is about the fact that I claim that it is possible to write numerical algorithms in a way so that it is safe to use -ffast-math in practice, and people say it's "impossible" and they question my use of it, sometimes with personal attacks. Several people on this thread also argue a straw man --- a position I didn't say or maintain --- which makes it pretty much impossible to hold any kind of discussion.

However, there is still one thing that I would like to discuss --- and that is your own position. Because the way I understood it from this thread is very different to what I understand from the IEEE standard.

Does the standard say that exp2 must be correctly rounded? @yuyichao says it doesn't, only "recommends". However, section 11 says that reproducible operations are those from section 5 and 9.2 and 9.3. Section 9.2 lists the exp2 function. But if exp2 function is not correctly rounded, then user's code that uses it is not reproducible since if I switch the libm library, I get different results. I guess the question is what is reproducible. If I keep using exactly the same openlibm version, then it is reproducible. But if I keep using exactly the same compiler version and system, then even -ffast-math is reproducible. I thought that reproducibility means that I can use any IEEE compliant system, but can swap compilers or standard libraries (in that case -ffast-math is not reproducible). What if somebody improves the openlibm library so that exp2 correctly rounds for all x? Then user's code changes.

What exactly does IEEE mandate on the exp2 function in openlibm?

JeffBezanson commented 6 years ago

To me the point is that openlibm exp2 at least has a spec like "gives an answer within 1 ulp of the correctly-rounded answer". -ffast-math doesn't come with any similar statement.

Above you mentioned an idea I think is pretty interesting: have the compiler do an analysis that decides which transformations are safe using the assumption that the user's code is numerically stable --- basically using the programmer as an oracle to tell you the relative magnitudes of numbers. It seems like such a technique might have been explored at some point. However I'm pretty sure gcc et. al. don't do anything like that. I could be wrong though.

certik commented 6 years ago

@JeffBezanson the problem is that this spec is not transferable I think --- if my library depends on openlibm, and I figure out a maximum error of my algorithm, the minute I switch (or upgrade!) openlibm, I have to redo the maximum error, because the new libm library will have a different spec, perhaps 2ulp, or perhaps 1ulp but for different x, which influences my algorithms in different ways. In other words, I have to redo the maximum error. But if you have to redo your maximum error every time you change libraries that you depend on, then what's the point in the first place? How is it different to my using -ffast-math? I also have to, strictly speaking, redo the maximum error every time I switch compilers (even versions of compilers), so it's more tedious, but it's not fundamentally different. The only thing that is different, it seems, is that with openlibm, at least the new openlibm will have some reasonable bounds, like 2ulp instead of 1ulp, and so my code will hopefully keep working, while theoretically the -ffast-math can screw things up much more.

But I was led to believe in this thread that the IEEE guarantees you much more than this. Update: IEEE does guarantee more than this, the problem is that openlibm is not IEEE compliant (if I understood the standard correctly). If openlibm was IEEE compliant, i.e., rounding correctly, then I can depend on it. My algorithm's error bounds will not change if I switch or upgrade openlibm to another IEEE compliant library. So that's a guarantee that I like. But as it is now, the IEEE only guarantees error bounds of openlibm no matter where you compile it, but not my code in general, I am tied to a particular openlibm version.

mbauman commented 6 years ago

I introduced the libm-function-swapping behaviors of ffast-math as yet another example of crazy things it can do (and a plausible way that its behaviors may change), but you're right it's slightly different because the non-ffast-math behaviors aren't guaranteed by the standard.

Let's hammer down a perspective here: are you using libm or are you implementing it? I shifted things around in that last example, but for the vast majority of this thread folks have been talking about how you might implement a numerically precise function within a library like libm. This is probably at the root of the mismatch.

If you're just using libm on your own machine, then it's up to you if you want to flip that -ffast-math switch. You can check and see if the results stay the same or are close enough for your compiler and processor. Do that across a sufficiently wide range of inputs, and you can be fairly well satisfied.

But if you're providing a library like libm, you don't have that luxury. The library needs to be precise across the entire domain and on many platforms. Let's see, does gcc on ARM do comparisons with an 80bit intermediate result under ffast-math? What about PPC? How does the existence of fma change the error guarantees here? Is the new version of clang sufficiently smart enough to see that I'm trying to access the error term of a subtraction and "helpfully" cancel it out for me? Might my carefully-crafted pairwise summation algorithm get re-associated into a left-to-right summation? Does that new intel processor have a faster-but-lower-accuracy fused-add-then-rdivide op that may screw things up here? Those are all questions that have well-defined answers from the standard if you don't use ffast-math. Good luck figuring out all that for all the supported permutations of compilers (and compiler versions) and architectures on which folks will compile your library. That's the impossible part.

It's just so much easier to live in a standards-defined world, and incrementally opt into optimizations by hand.

certik commented 6 years ago

are you using libm or are you implementing it?

I am using it! I think you nailed it, @mbauman. All this thread I am talking about a perspective of developing a numerical algorithm, either as an application or a library, that however is using tons of building blocks like openlibm (non IEEE guarantees), BLAS (non IEEE guarantees, see my quotes above).

So it's all nice that you use the IEEE guarantees while developing openlibm. But by the time it gets to me, you have destroyed the IEEE property, because exp2 is not IEEE compliant. If I use a different libm, everything changes in my code.

And with blas, that does not guarantee the order of operations, I can't guarantee anything at all. As you said, my code has to work across platforms, across all libm implementations, across all BLAS and LAPACK implementations.

So how am I supposed to use IEEE to guarantee results like you can do for openlibm?

If you're just using libm on your own machine, then it's up to you if you want to flip that -ffast-math switch.

Except you can't! Because the openlibm maintainers said they will not support -ffast-math, and as shown above, I had to submit a PR to make it even produce correct results with -ffast-math. So I am stuck with the slower version of openlibm.

yuyichao commented 6 years ago

are you using libm or are you implementing it? All this thread I am talking about a perspective of developing a numerical algorithm, either as an application or a library, that however is using tons of building blocks like openlibm (non IEEE guarantees), BLAS (non IEEE guarantees, see my quotes above). So it's all nice that you use the IEEE guarantees while developing openlibm. But by the time it gets to me, you have destroyed the IEEE property, because exp2 is not IEEE compliant. If I use a different libm, everything changes in my code.

I believe most libm implementation don't have that. However, it's not what matters at all and even when libm returns an error, it is usually bound. When you switch libm implementation, you should make sure that the new implementation is accurate enough. It is fundamentally different from what fast-math does since fast-math is a global transformation and it is impossible to put a bound on what it can produce. Therefore, given the status of libm implementation, you might not be able to get exactly reproducible results, but the (bound of the) result is still predictable/reasonable. As I said many times, I'm totally fine with transformations that changes results as long as they are well defined.

Except you can't! Because the openlibm maintainers said they will not support -ffast-math, and as shown above, I had to submit a PR to make it even produce correct results with -ffast-math. So I am stuck with the slower version of openlibm.

This is wrong. As others have said, fast-math won't do anything that you can't do directly. So all what you need to do is to figure out what transformations the flag does and do that directly.

certik commented 6 years ago

This is wrong. As others have said, fast-math won't do anything that you can't do directly. So all what you need to do is to figure out what transformations the flag does and do that directly.

@yuyichao I don't believe that. At this point, I would like to see some code from you guys showing me how you can match the performance of my PR here: https://github.com/JuliaLang/openlibm/pull/171. There were two things I have done:

I used -ffast-math. You claim it is possible to match that speed with some options. Please show me. The only person who helped was @simonbyrne (again, thank you Simon!). He said that maybe it is the -ffp-contract=fast, so I tried that and it wasn't it. I don't know what else it could be.
I got rid of the NaN/Inf handling. I don't need NaN/Inf, so why would I, as a user, have to pay a performance penalty for something that I don't need? You said it's possible to keep these checks in, without paying for it in terms of worse performance. Well, then show me. I suspect it might be caused by this line: https://github.com/certik/openlibm/blob/88d1a3e4b00892f055f6c4045e6297ed08c44ef5/src/s_exp2.c#L351, as there is some work to be done before the if statements are called.

I claimed that -ffast-math can maintain good accuracy, and increase performance, while keeping the code nice and readable. Talk was cheap, so I showed you the code how to do that. Now it's your turn to show me the code how to do what you claim: keep IEEE guarantees, but match the performance of my PR.

If it's possible at all, I don't think it's easy, you would have to invest significant effort into this to match the speed of my PR. But I will be happy if you can prove me wrong.

certik commented 6 years ago

Regarding the IEEE compliance of openlibm, I have asked here, as stackexchange is a better forum to discuss this well defined question. @yuyichao I would be interested if you provided an answer there.

yuyichao commented 6 years ago

I used -ffast-math. You claim it is possible to match that speed with some options. Please show me. The only person who helped was @simonbyrne (again, thank you Simon!). He said that maybe it is the -ffp-contract=fast, so I tried that and it wasn't it. I don't know what else it could be.

Sure, that can be worked on, but that's is totally unrelated to using fast-math. See below. (which repeats what i've said many times above)

I got rid of the NaN/Inf handling

This is strictly wrong so doing that is not on the table at all. No matter whether it has any performance impact.

so I showed you the code how to do that. Now it's your turn to show me the code how to do what you claim: keep IEEE guarantees, but match the performance of my PR.

No you didn't. I'll repeat that you can never show, no matter how much code you write and no matter how many test you've done, that fast-math can produce reliably results. The only way to do that is to give a prove that is based on solid specification.

If you want to prove that fast-math is safe to use, showing you how to do what we claim is completely unrelated. For the task of improving openlibm performance in general, making sure the code can run as fast as what you currently get with fast-math is the right thing to do. That's also why I don't want to talk about what you can achieve with fast-math on a particular compiler since that's completely off topic.

certik commented 6 years ago

@yuyichao you keep arguing a position I don't hold. I didn't say reliable (as in based on IEEE standard or a solid specification), I said "good accuracy". So you keep arguing that -ffast-math is not reproducible or not following the IEEE standard --- but of course I agree with you on that. But that's not what I am saying. What I am saying is that -ffast-math provides "good accuracy" (on a particular platform for a particular code/compiler/system, etc.). So how do you check for "good accuracy"?

Well, that is actually my question for you: how do you check the accuracy of exp2 in openlibm? The way I did it in https://github.com/JuliaLang/openlibm/pull/171 was that I ran 100,000 random numbers from the allowed range, and used quadruple precision in Fortran to check against exp2, and I found that it is rounded correctly in 99.96% of cases and only 1 ulp wrong in the rest of the cases --- which is very impressive! If you write a library like openlibm, based on the IEEE standard, how do you check your implementation to satisfy "Accuracy: Peak error < 0.503 ulp for normalized results."? Do you just run it like I did, or is there some math way to prove this peak error?

JeffBezanson commented 6 years ago

the problem is that this spec is not transferable I think --- if my library depends on openlibm, and I figure out a maximum error of my algorithm, the minute I switch (or upgrade!) openlibm, I have to redo the maximum error, because the new libm library will have a different spec

This is a misdirection. The thing that matters is that there is a spec. No compiler can or does make a statement like "using -ffast-math will give results within 1000 ulps of the original program". Furthermore, accuracy standards tend to be adopted pretty widely --- more than one libm implementation offers 1ulp accuracy. And every libm should document pretty clearly what the error bounds are. That's a totally different world from -ffast-math, where you can't even get a concise description of what it does, let alone an error bound.

It's reasonable to say "-ffast-math provides good accuracy with this compiler on this program", because that can indeed be empirically validated. But -ffast-math overall cannot be said to have any particular accuracy, and there's no way to implement it that promises any kind of retention of accuracy.

certik commented 6 years ago

It's reasonable to say "-ffast-math provides good accuracy with this compiler on this program", because that can indeed be empirically validated. But -ffast-math overall cannot be said to have any particular accuracy, and there's no way to implement it that promises any kind of retention of accuracy.

@JeffBezanson absolutely. I agree. I never said anything to the contrary in this thread I hope.

certik commented 6 years ago

Can you guys please answer my questions from: https://github.com/JuliaLang/julia/issues/24568#issuecomment-345136849. They didn't get answered.

ViralBShah commented 6 years ago

Perhaps this discussion is best moved to discourse? Seems quite a bit off-topic by now.

certik commented 6 years ago

@ViralBShah ok, let's move it there. I have many unanswered questions in this thread, and they got lost. I would like those to be answered.

@JeffBezanson here is a list of the -ffast-math transformations that gcc does: https://gcc.gnu.org/wiki/FloatingPointMath, since you were asking about it.

JeffBezanson commented 6 years ago

Of course the existence of -ffast-math does't violate the IEEE standard. I don't think anybody claimed it does. Only when you turn the flag on, you get non-IEEE arithmetic.

The IEEE spec isn't a sacred text. It's just a leading example of a description of operations that makes it possible to understand what a program does. "Openlibm gives answers within 1ulp" is another such description. Since that description is concise and comprehensible, you can for example easily decide that it's not good enough, and use crlibm instead. But -ffast-math is a fuzzy ball of transformations whose effects are hard to reason about. The more specific flags like -ffinite-math-only are far better since they're easier to reason about, and if two compilers provide it they probably do the same thing.

certik commented 6 years ago

@JeffBezanson thank you. I will ask more well defined questions on Discourse.

simonbyrne commented 6 years ago

If you write a library like openlibm, based on the IEEE standard, how do you check your implementation to satisfy "Accuracy: Peak error < 0.503 ulp for normalized results."? Do you just run it like I did, or is there some math way to prove this peak error?

There are 2 ways: a) you exhaustively test all such values. This isn't too hard for a single-argument Float32 function, but is typically too expensive for multiple-argument or Float64 functions (though it has been attempted). Note that this is really only valid for a particular binary: if you recompile with a different compiler or different flags you have to redo it all over again.

b) You prove it using numerical error analysis. This requires some assumptions: the main one being that elementary arithmetic operations -- +,-,*,/, sqrt -- obey IEEE properties. A good example of this is Table-Driven Implementation of the Exponential Function in IEEE Floating-Point Arithmetic by Peter Tang. Unfortunately this tends to be rather tedious, with lots of potential pitfalls. There has been some great work recently on formal proofs using automated theorem provers (e.g. http://www.cl.cam.ac.uk/~jrh13/papers/tang.html), but it is still far from easy.

certik commented 6 years ago

@simonbyrne thanks. Yes, it's a lot of work either way. What I meant is --- what work you actually did with openlibm to guarantee the accuracy? I think you did neither a) nor b).

The reason I am asking is that people in this thread told me they can't trust my results based on -ffast-math. So I am asking what actual work you did so that I can trust your results from openlibm? I.e. that I can trust that exp2 is truly only 1ulp wrong for any input x.

yuyichao commented 6 years ago

you keep arguing a position I don't hold. I didn't say reliable (as in based on IEEE standard or a solid specification), I said "good accuracy".

"good accuracy" for what??? You only show that it provide good accurracy for a few existing compiler implementations, and that's exactly what I mean by unreliable/unpredictable.

So you keep arguing that -ffast-math is not reproducible or not following the IEEE standard --- but of course I agree with you on that. But that's not what I am saying. What I am saying is that -ffast-math provides "good accuracy" (on a particular platform for a particular code/compiler/system, etc.). So how do you check for "good accuracy"?

As I said, you are welcome to submit patch that does not slow down code without fast-math and that still do the same thing (so no nan check removed) if that make it possible to compile it on a few compilers that you've tested the fast-math mode on. (edit: and it need to not slow down anything on any platforms we support)

The reason I am asking is that people in this thread told me they can't trust my results based on -ffast-math.

Correct, and I repeat again that no amount of test you can do can be trusted with fast-math. Doesn't matter if you do the same test that has been done on openlibm (or the freebsd libm where this is forked from).

certik commented 6 years ago

Correct, and I repeat again that no amount of test you can do can be trusted with fast-math. Doesn't matter if you do the same test that has been done on openlibm (or the freebsd libm where this is forked from).

I think this is the crucial part that we need to clarify. Then we can wrap up this discussion.

Please tell me exactly what you did so that I can trust the maximum error result from exp2 from openlibm. I think you didn't do anything, you just took it from freebsd. Since the exp2 code is compiled with an IEEE compiler, then whatever tests freebsd guys did are still valid for openlibm. Ok. But what exactly did freebsd guys did to obtain the maximum error bound for exp2?

Did they use some math or just "smoke testing", as @StefanKarpinski calls it. ;)

simonbyrne commented 6 years ago

@simonbyrne thanks. Yes, it's a lot of work either way. What I meant is --- what work you actually did with openlibm to guarantee the accuracy? I think you did neither a) nor b).

I personally haven't, but the openlibm code is a modernised version of the msun code which was written largely by Kahan's students. You'll see that throughout the code are large blocks of comments which detail the error bounds, albeit in a slightly informal manner, e.g. here.

As to the matter of trust: well, that is ultimately up to you, but I do trust the openlibm code, as it was written by top-notch numerical experts, has withstood 2 decades of widespread use, and the parts that I have read and analysed are clearly intelligently written and very well thought through.

JuliaLang / julia

The gfortran benchmarks should use the "-march=native -ffast-math -funroll-loops" options #24568