JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.36k stars 5.46k forks source link

The gfortran benchmarks should use the "-march=native -ffast-math -funroll-loops" options #24568

Closed certik closed 6 years ago

certik commented 6 years ago

Specifically, when I compare

gfortran -O3 perf.f90 && ./a.out
gfortran -O3 -march=native -ffast-math -funroll-loops perf.f90 && ./a.out

on my machine, then I get 2x speedup on the iteration_pi_sum benchmark. I haven't tested the C code, but a similar speedup might be possible.

Note: if you use a recent gfortran compiler (e.g. I just tested this with 7.2.0), you can just use:

gfortran -Ofast perf.f90

And it will turn on the proper options for you, and produce optimal results.

yuyichao commented 6 years ago
julia> f2(x) = (redux = 0x1.8p52 / 256; x - ((x + redux) - redux))
f2 (generic function with 1 method)

julia> f3(x) = x - round(x*256)/256
f3 (generic function with 1 method)

julia> x = reinterpret(Float32, UInt32(2074916431))
1.7935454f36

julia> f2(x)
0.0

julia> f3(x)
-Inf32
simonbyrne commented 6 years ago

So far it didn't find anything.

That looks like it should be equivalent, assuming your round breaks ties by rounding to the nearest even (which Julia does, but C for example does not).

The problem is that you've replaced a line which does 3 addition/subtractions with one which does a subtraction, 2 multiplications and a round (since 256 is a power of 2, it is allowed change the division to a multiplication without any problems). This will take longer on almost any processor.

simonbyrne commented 6 years ago

@yuyichao To be fair, the redux constant would have been chosen for Float64 values.

certik commented 6 years ago

@simonbyrne if we assume that they are equivalent (I don't know if @yuyichao found a counter example or not), then you are arguing speed as the disadvantage. Lookup my comments above (you can replace power of two multiplication by a special function) and so on, especially about providing a faster implementation if that is needed. Either way, that is a whole different discussion from "it cannot be done without IEEE".

StefanKarpinski commented 6 years ago

Either way, that is a whole different discussion from "it cannot be done without IEEE".

So far we've provided numerous examples of things that cannot be done in -ffast-math mode:

The proposed solution is "have library functions that implement all of those as 'primitives'." However, there is an unbounded number of things you cannot do without IEEE guarantees, so that means you want to add an infinite number of new primitives. Moreover, if you put these operations behind function calls, then they will become much slower – since most of them are just a couple of instructions. If you had to implement Kahan summation using an add_error function, it would be much slower; and if the compiler is allowed to inline the add_error function, then it would presumably optimize away the inlined definition, breaking the implementation.

On the flip side, there is nothing -ffast-mode can optimize that you can't also do in IEEE mode by manually applying the same simplifying rules and other compiler annotations.

simonbyrne commented 6 years ago

I missed that comment. ldexp is almost certainly going to be slower than a multiplication on any processor from the last 10 years

yuyichao commented 6 years ago

If this count as "implementable without IEEE" then there's basically nothing that's not implementable. The simplest proof is that there exist software floatingpoint library that's ieee complaint so you don't need any knowledge from the compiler at all to get IEEE behavior from scratch and then you can build everything from there. That's not much different from saying that a slower implementation can be bad enough for compiler optimizations so that the outcome becomes predictable since the compiler can't do anything anymore.

Providing a faster implementation

So far I've only seen asm implementation being mentioned. To start, as Stefan have already pointed out, that's basically not an argument. The point of having C/julia whatever language is to provide a higher abstraction for those lower level operations and saying that by adding a feature you have to not use the language anymore that's not really the same thing anymore.

Also, with modern compilers, a few inlined asm is rarely good for performance anymore. Compilers are smart enough at doing this low level stuff and while you can sometimes still beat it by careful instruction scheduling and such when writing a whole function in asm, adding a few inline asm will generally only hurt performance since they usually restrict the compiling from freely choose the form of the instruction that fits the surrounding code better. It's usally a wash at best.

certik commented 6 years ago

@simonbyrne ok, let's use a multiplication then. It's much clearer anyway.

@StefanKarpinski, yes, what you just wrote is correct. There are things you cannot with -ffast-math. In Fortran school, I argue you don't need any of those. I just talked with a few colleagues about this, and one thing that a Fortran school has is that the results of -ffast-math will not change the result of your program (besides perhaps from 1e-15 to 1e-14 level) as long as your algorithm is well conditioned. In other words, if your algorithm depends on which direction you sum up an array (Kahan summation), then it's badly conditioned, and yes, then things can break. But the point is that we deal with numerical discretizations that introduce an error, and the point is to keep these errors coming from floating point below the discretization level. So they can't creep to, say, 1e-6 level, but as long as they stay around 1e-14 or so, we don't care if IEEE would improve those from 1e-14 to 1e-15. That is the key point. If our algorithm is badly conditioned, or the input or some right hand side or boundary condition is so badly conditioned that it matters how you sum it, then we are already screwed and have much bigger problems than some order of summation.

Regarding your point about subroutines --- if your algorithm truly depends on add_error and it is performance critical so that a function call is a problem, then you have to write the whole algorithm in the IEEE mode. But let's not talk abstractly, give me an example. That's why I asked about the exp2 example, whether it is such an example. I.e. do you believe that if I write exp2 without IEEE (whether in C or Fortran, whatever ends up faster) and using -ffast-math, that it will be slower than the current exp2 implementation in openlibm?

On the flip side, there is nothing -ffast-mode can optimize that you can't also do in IEEE mode by manually applying the same simplifying rules and other compiler annotations.

If you believe this, then you must conclude that it is not possible to write exp2 with -ffast-math to be faster than your IEEE version, otherwise you can trivially improve the IEEE version to be as fast.

StefanKarpinski commented 6 years ago

There is a different hypothetical universe where your position would make a lot of sense: one in which hardware did not implement IEEE floating-point arithmetic and instead had to emulate it somehow (in the extreme, as @yuyichao suggested, with integers). In that case, letting the hardware "do its thing" would be much faster (and possibly more accurate).

But that's not the world we live in. In this reality, hardware implements IEEE floating-point arithmetic, so there's nothing that hardware can even compute that a compiler in IEEE mode cannot express. Are there still cases where a programmer has not algebraically simplified their expressions as much as they could? Certainly. In those cases -ffast-math might indeed speed up that code. But I'd rather assume that the programmer knows what they're doing and take them at their word code.

certik commented 6 years ago

@StefanKarpinski the SIMD in the pi example is one such case, that started this thread. You said you are against Fortran allowing to use -ffast-math. But if what you just wrote is true, why would you be against? You literally just said you can keep not using -ffast-math and still be able to improve the code to match performance of, say, Fortran with -ffast-math. So let's allow Fortran its -ffast-math, and then let's improve Julia to keep IEEE, and no -ffast-math, but deliver the same performance. As you just said, you would rather do that. ;)

simonbyrne commented 6 years ago

do you believe that if I write exp2 without IEEE (whether in C or Fortran, whatever ends up faster) and using -ffast-math, that it will be slower than the current exp2 implementation in openlibm?

I would be skeptical that you would get a speedup and retain the same accuracy. Did you benchmark your changes?

StefanKarpinski commented 6 years ago

I just talked with a few colleagues about this, and one thing that a Fortran school has is that the results of -ffast-math will not change the result of your program (besides perhaps from 1e-15 to 1e-14 level) as long as your algorithm is well conditioned.

There's no way to be sure that you will only get 1e-14 error. If you only happen to get small errors then you are lucky. There are multiple examples in this thread of just a couple of instructions where -ffast-math gives an arbitrarily wrong answer.

In other words, if your algorithm depends on which direction you sum up an array (Kahan summation), then it's badly conditioned, and yes, then things can break.

My example had nothing to do with the algorithm – the algorithm in question was just left-to-right summation of set of numbers. The data, however, was such that the order matters. Kahan summation gives a completely accurate answer when adding those same numbers. So your position is that left-to-right summation is not an acceptable algorithm? (But Kahan summation, which cannot be implemented in your school of thought is an acceptable algorithm because it's far less sensitive to input data order. That's a bit ironic.)

If you believe this, then you must conclude that it is not possible to write exp2 with -ffast-math to be faster than your IEEE version, otherwise you can trivially improve the IEEE version to be as fast.

Yes, that is the case. -ffast-math isn't magic – any optimization it does can also be done in IEEE mode by changing the code.

@StefanKarpinski the SIMD in the pi example is one such case, that started this thread. You said you are against Fortran allowing to use -ffast-math. But if what you just wrote is true, why would you be against? You literally just said you can keep not using -ffast-math and still be able to improve the code to match performance of, say, Fortran with -ffast-math.

A targeted annotation like @simd in Julia is much more limited and therefore much more acceptable, although it is still technically computing a different result (which is why we don't use it in the Julia code – to keep the cross language comparison fair). Rather than blindly giving the compiler license to do any number of incorrect simplifications, the @simd annotation tells the compiler that it may rearrange the dependencies between iterations of a loop in order to vectorize it. If you know that you don't care about that change in meaning then this is a fine thing to do. Does Fortran support limited vectorization annotations like this or is it all or nothing?

certik commented 6 years ago

@simonbyrne I haven't, since it's a lot of work to do it properly, and I only want to invest the work into it, if it could move this discussion ahead.

One issue we had above was that icc can't be trusted to compile openlibm, and I have, hopefully convincingly by now, shown that indeed icc can be trusted, but one has to rewrite parts of the code not to depend on IEEE.

So now we moved on, and now we are discussing the performance of my changes and the accuracy of my changes.

@StefanKarpinski, @vtjnash, @simonbyrne So if I invest the work into this, are you willing to admit that you can give up IEEE, and retain accuracy and keep or improve the speed (if that's what the results will show)? If I am unable to do that, I will happily admit that for some stuff, you need an IEEE compiler. Ultimately the answer will be in between, but it'd be nice to see the numbers of what we are talking about, which can't be done unless we have IEEE and non IEEE version of exp2.

certik commented 6 years ago

Does Fortran support limited vectorization annotations like this or is it all or nothing?

As far as I know it is all or nothing --- some compilers allow annotations I think which might allow it, but I never used them.

certik commented 6 years ago

So your position is that left-to-right summation is not an acceptable algorithm?

So far I think it's not an acceptable algorithm. However, I admit this was a bit surprising to me when I realized that today. But that's where the logical conclusion of the Fortran school of thought seems to lead. I am not 100% sure if I didn't make a mistake in reasoning here. If you see a hole, let me know.

yuyichao commented 6 years ago

One issue we had above was that icc can't be trusted to compile openlibm, and I have, hopefully convincingly by now, shown that indeed icc can be trusted, but one has to rewrite parts of the code not to depend on IEEE.

No. As I've shown above, fast-math is by construct unpredictable. What you are demostrating is to write code that is complex enough that the compiler can't reason about them currently. There's no guarantee that it'll work in the future.

yuyichao commented 6 years ago

What you've also shown is that you can write (suboptimum) code that give the right result both according to IEEE spec and according to math. Sadly, fast-math is none of those.

StefanKarpinski commented 6 years ago

Turning -ffast-math or equivalent on in various languages would be an interesting additional data point, but we're not going to switch the main benchmark implementations. If you're interested in experimenting with exp2 you're welcome to explore it, of course, but I figured I should make that clear before you spend any time on it.

certik commented 6 years ago

@StefanKarpinski I do not insist you change your benchmarks. I do ask however, that if the non IEEE version of exp2, that I am personally happy with, and after scrutiny, shows that it is as fast or faster (i.e. not slower), and works robustly with -ffast-math (i.e. today and in the future), that you will admit that -ffast-math is a legitimate use, and the Fortran school of thought is a legitimate way of thinking about numerical computing. That of course does not mean it is the only way of thinking. IEEE thinking is the other one.

yuyichao commented 6 years ago

works robustly with -ffast-math

AFAICT there's no such thing

certik commented 6 years ago

AFAICT there's no such thing

@StefanKarpinski, @yuyichao well if there is no evidence to convince you, then of course you save me lots of time. :) It's up to you to decide what evidence you want to see, and I can then decide whether I can deliver it or not.

yuyichao commented 6 years ago

I can be convienced with a fixed set of guarantee given by fast-math mode. AFAICT there's none and the transformations that the compiler can do are implementation defined and constantly increasing.

mbauman commented 6 years ago

PRs for valid optimizations are always welcome. The key is that any re-arrangement the compiler does with -ffast-math can also be explicitly opted into on a piece-meal basis by the programmer in a "precise" mode. So you don't even need to go whole-hog with -ffast-math. You can just incrementally start applying those optimizations.

That's the power of deterministically following one set of rules.

simonbyrne commented 6 years ago

I remain skeptical: openlibm, while quite old and probably not optimal (Intel and Apple math libraries are often faster), is still very efficiently-written code and has very well-established accuracy bounds. But I would be excited to see what you can achieve.

certik commented 6 years ago

@yuyichao what kind of guarantees do you want? Speed, accuracy, etc? I can guarantee the accuracy with -ffast-math, since the exp2 algorithm seems well conditioned, and so the errors potentially caused by -ffast-math will stay in the last few significant digits, they will not creep into the 1e-4 level, as they did with the exp2 originally, which was caused by incorrect code in the Fortran school of thought (the code is valid in the IEEE school of thought). That is my claim.

yuyichao commented 6 years ago

what kind of guarantees do you want

Proof that the compiler will not (in previous or future versions) do any transformations that gives diviation larger than what you claim on all input values including corner cases. Note that since fast-math is allowed to do non-instruction-local transformation this proof has to be applied to the whole function.

simonbyrne commented 6 years ago

I would want comparable accuracy to openlibm, i.e. to within < 1 ulp (i.e. be rounded to one of the two closest floating point numbers). If you were willing to ignore the last few significant digits, then of course you could do much faster, with or without fastmath.

yuyichao commented 6 years ago

And I'll repeat from what I said in the openlibm PR that changes making the code more compatible with icc without any negative performance impact (and not just hacks with compiler version checks) are certainly acceptable but that's still a totally different issue from claiming the code is fast-math compatible. For that, there has to be a precise enough definition of fast-math mode, precise enough to give guarantee about possible outputs.

StefanKarpinski commented 6 years ago

So your position is that left-to-right summation is not an acceptable algorithm?

So far I think it's not an acceptable algorithm. However, I admit this was a bit surprising to me when I realized that today. But that's where the logical conclusion of the Fortran school of thought leads.

That is surprising, isn't it? So let's do a different and simpler challenge before reimplementing exp2:

In other words, try to find a summation algorithm whose results are always between (1-1e-14)*s and (1+1e-14)*s where s is the true sum, regardless of how a -ffast-math compiler decides to implement it. To keep it simple, let's say it only needs to handle data where the true sum of the absolute values is less than the max finite double (≈ 1.8e308).

certik commented 6 years ago

@simonbyrne the -ffast-math can make mistakes in the polynomial evaluation for exp2. But that polynomial evaluation does not need to get all the significant digits right, because it gets added to a much larger (in magnitude) number that is exactly loaded from a table, at least that's my understanding. So overall, there might be no changes at all, in fact that is precisely what I've seen when I ran the whole test suite. So to guarantee accuracy, one would have to figure out what accuracy you need the polynomial evaluation to be at. And then what possible accuracy you might lose by rewriting it by the compiler to a different order of execution.

certik commented 6 years ago

@StefanKarpinski the acceptable summation algorithm by my definition is: any summation you want, as long as it does not depend on the order. So left-to-right is acceptable only if also right-to-left or any other order also works. So it depends on the data. You can't feed it your data from sumsto above.

So, as an example of the exp2 function, it must work for any argument from a defined range of values for which the function is supported.

StefanKarpinski commented 6 years ago

Are saying that in your philosophy there are no acceptable general purpose summation algorithms?

simonbyrne commented 6 years ago

So to guarantee accuracy, one would have to figure out what accuracy you need the polynomial evaluation to be at. And then what possible accuracy you might lose by rewriting it by the compiler to a different order of execution.

It can be done, though it's fairly tedious. See chapter 5 of Nick Higham's book.

yuyichao commented 6 years ago

what possible accuracy you might lose by rewriting it by the compiler to a different order of execution.

And then you need to proof this is the only thing the compiler can do.

certik commented 6 years ago

@StefanKarpinski in the summation code that we use, it is typically distributed over MPI, and it must work for any number of processors, so that obviously gets things out of order. And it must always give you the same answer to machine accuracy. So if you feed it your sumsto data, you'll get garbage out. But if you feed it our simulation data (the summation is used during simulation) then it will work as expected. I am not quite sure I understand your question. My answer is that the answer of a sum of an array cannot depend on the order you sum it, otherwise our code will not work.

simonbyrne commented 6 years ago

I should also say: the obvious way to speed up computation of polynomials is to use fmas wherever possible on supported architectures. Although not considered an IEEE-compatible transformation, it actually increases accuracy (as you avoid more intermediate rounding steps).

Unfortunately, doing this is somewhat difficult in C. Some compilers provide a fp-contract option, but you typically don't want to combine all multiply-adds (e.g. it can cause problems with complex multiplication).

Julia provides a muladd function which automatically uses the optimal approach: this is part of the reason we're moving the math functions to pure Julia code.

certik commented 6 years ago

@simonbyrne, @yuyichao for the exp2 you are really pushing the accuracy, so I don't want to promise something that cannot be done. But for my other numerical codes, I can guarantee an accuracy, say 1e-8, with or without -ffast-math. I can certainly guarantee a certain accuracy for the polynomial part. It might be slightly less than IEEE, but it will be good. The question is whether that is enough to deliver the 1ulp total accuracy. It seems it is, based on my limited experiments.

I think the advantage of IEEE is that you don't need such a proof. All you have to do is write the code, run it for all double precision numbers, not the maximum error and you know that it won't change in the future. Am I right?

While without IEEE, you actually have to numerically show that it will give the accuracy. Not just run it once, since a different compiler might get slightly different errors, and then instead of 1ulp, you might get 2ulp.

I admit I don't have experience to keep things under 1ulp. For me 1e-14 for such code was enough in the past. On the 1e-14 level or so, the -ffast-math doesn't seem to matter for accuracy, based on my past experience. So it's a bit gamble on my part, I'll admit that.

The pi example that started this thread however, I don't think I care at all if the last digit or two are wrong. I just want the performance. At least that's how I interpreted the benchmark. I think @StefanKarpinski wants all the digits to stay predictably the same (but if you introduce SIMD, the last digit must change, will it not? But perhaps again in a predictable (to future) way, perhaps that's the point). And that's perhaps the difference here, why we need both IEEE and Fortran school benchmarks.

yuyichao commented 6 years ago

While without IEEE, you actually have to numerically show that it will give the accuracy. Not just run it once, since a different compiler might get slightly different errors, and then instead of 1ulp, you might get 2ulp.

More than that, as I've said, fast-math itself is basically ill defined and their definition changes as compilers learns more math.

yuyichao commented 6 years ago

Some compilers provide a fp-contract option

GCC actually have that on by default. It might make gcc not standard complaint (by default) anymore but to a qualitatively less degree since the transformation allowed by it is well defined and in reality few algorithms seems to be affected negatively.

StefanKarpinski commented 6 years ago

Distributed is irrelevant. The data in my summation example is admittedly weird, but those were just chosen for a party trick. Here is some perfectly reasonable data:

julia> v = sort!(rand(2^20));

julia> foldl(+, v) - foldr(+, v)
4.7497451305389404e-8

Even with a very modest amount of uniform random data, the left-to-right and right-to-left summations give answers that differ by ≈ 5e-8. So by your definition these are therefore both unacceptable algorithms and cannot be used.

Unfortunately, in your philosophy these are the only algorithms. This follows from the transformations -ffast-math is allowed to do: any summation algorithm that just adds all the values in some order is equivalent to left-to-right summation since the compiler could choose to add them in that particular order. Fancier algorithms like Kahan summation (which might save you in IEEE mode) are also equivalent to left-to-right summation since the error computation is always algebraically zero and can therefore be removed by -ffast-math rules. So left-to-right summation is the only summation algorithm in your philosophy and it is an unacceptable one.

The conclusion is that in your philosophy, there are no acceptable algorithms for something as basic as summing a set of numbers, let alone more complicated problems. That seems like a fundamental problem with the philosophy, which might need reconsideration.

JeffBezanson commented 6 years ago

basically ill defined changes as compilers learns more math

This. To me the issue is "what does the program mean?" Mathematically, ((x+y)-x)-y "means" zero but there are an unlimited number of equivalences of that kind, and there is no limit to what you might have to know to understand a program in that way.

Ultimately, IMO, making operations less-well-specified cannot help performance. The machine still has to do something in particular, and if that thing is (1) desired for performance, and (2) gives an acceptable answer, you might as well just write it in your program and be able to count on it (rather than risk it going away or changing in a future compiler version).

yuyichao commented 6 years ago

I'll also add that "anything fast-math can do can be done manually" is not strictly true in all environment. AFAICT it's not in a few cases,

  1. The source language doesn't have a easy way to express the code compilier can generate.

    This includes simd and fma but languages are catching up so this is much less of an issue and can be dealt with with more precise annotation instead. (like @simd or muladd in julia or fp-contract in C, LLVM 6.0 will also improve on this for fastmath flags)

  2. To use different code that gives similar results to exploit hardware features.

    This is likely a major concern before ieee was well adopted by hardware vendors but since every piece of hardware we actually care about these days implement ieee complaint fp ops and since they tend to have similar cost models (certainly not always the case but basically no one really figured out a way to do any operation in hardware significantly faster than others) there's not much instructions to pick from that can actually produce different results faster. (Or in another word, people aren't adding new fp instructions that produce different results than existing ones). The only exception I've seen in this case is actually fma (again) so the fp-contract or muladd will handle it.

These are mainly problems for C if one don't want to rely one compiler specific features. It's not as much an issue on modern hardware and julia since we basically have got features to handle all those. Even in C, these limitations can usually be overcome using compiler specific flags that apply these in a more limited scope instead of fastmath.

certik commented 6 years ago

Ultimately, IMO, making operations less-well-specified cannot help performance.

If that was true, then I think it logically follows that -ffast-math cannot help performance, because the only thing it is doing is making operations less-well-specified (i.e., breaking IEEE), and then taking advantage of it. So I think your statement might be true in theory, but not in practice.

certik commented 6 years ago

Anyway, I am disappointed that none of you except @simonbyrne stepped up to the challenge. So for @simonbyrne, here are the results:

https://github.com/JuliaLang/openlibm/pull/171

The accuracy did not get worse with -ffast-math (in fact it got better, but that might not be statistically significant), the original code is now 1.5x slower, and I haven't done anything, just turned on -ffast-math, and made the code IEEE independent.

Anyway, you are all welcome to poke through my work, scrutinize it, reproduce it, etc.

But it proved everything I claimed in this thread. That IEEE compliance costs you significant performance in practice. The accuracy did not get worse in practice. And the code can be made IEEE independent in practice. So this is an example of a Fortran school of thought. You get significantly better performance, but keeping accuracy and the code is not dependent on IEEE conventions.

But perhaps I made a mistake somewhere, go ahead and check my work. I also haven't tried the Intel compiler, just gcc.

StefanKarpinski commented 6 years ago

So you are completely ok with the fact that it's impossible to write what you deem an acceptable algorithm for summation in this school of thought? That seems like a fairly fundamental problem with a philosophy of numerical programming.

certik commented 6 years ago

You seem to misunderstand what I wrote. Any summation algorithm that does not depend on the order of operations is acceptable, as I wrote above.

On Thu, Nov 16, 2017, at 06:19 AM, Stefan Karpinski wrote:

You are completely ok with the fact that it's impossible to write what you deem an acceptable algorithm for summation in this school of thought? That seems like a fairly fundamental problem with a philosophy of numerical programming. Without a way to add up a bunch of numbers in a way that's acceptable by its own standards, I can't take this approach seriously.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/24568#issuecomment-344920213

StefanKarpinski commented 6 years ago

Any summation algorithm that does not depend on the order of operations is acceptable, as I wrote above.

There are no such algorithms – at least not that can be expressed in -ffast-math mode. The point is that this "philosophy" is fundamentally broken. All it amounts to is "let the compiler do what it wants, run it on some inputs, see if it looks like it's working and call it a day." In other words, there is no actual philosophy. If that works for you and gives you the performance you want, great – carry on. But don't be deluded that this is a coherent, reliable school of thought in numerical computing. It's a holdover from a dark time when numerical computing was inherently non-portable and hardware and compilers did whatever they felt like and there was nothing better you could even hope for.

Feel free to continue operating in that world for yourself, but this is no way to write libraries that other people rely on for their computations to be correct. You may have sped up some old crufty exp2 code and run some smoke tests to make sure the new version is not completely broken, but unfortunately without IEEE compliance, that's not good enough – you need a proof that none of the optimizations that -ffast-mode allows can possibly lead to incorrect results. Otherwise it might be giving wrong answers for some values right now – Float64 is too big to check. Or it might not. But it could start doing so at any time in the future when we upgrade compilers.

certik commented 6 years ago

@StefanKarpinski I missed your comment about summation https://github.com/JuliaLang/julia/issues/24568#issuecomment-344768416 above, I just read it now, my apologies. There is nothing wrong with that example, you sum up over a million random well conditioned numbers, and you got a relative accuracy of 2.95e-14, which is almost 14 significant digits, and that is perfectly fine given that double precision floating point can only represent ~15 or so digits. The mistake you made is that the absolute difference was only 1e-8, but that's because the sum is around 524231.03169143666. So you have 6 significant digits before the decimal point, plus the 8 digits after the point, total of 6+8=14. So there is no loss of accuracy, and it does not matter if you sum it from the left or right. Which is what I am saying --- you need to have a well conditioned input (algorithm) and then it does not matter how you sum it, whether left to right, or right to left, or distributed. By my definition, the algorithms would be unacceptable if they differed on 1e-8 of relative accuracy, i.e. losing 7 or so significant digits in the process. That's unacceptable. But in this case, it only lost 1 or 2 digits, so there is no problem.

According to Wikipedia, Julia uses a pairwise summation, which is an acceptable algorithm, because you can change the order of operations in the individual for loops. On your example, both sum and sum_kbn gives exactly the same answer (524231.03169144236) and it also differs on the 1e-14 relative error level from the left or right summations.

Anyway, I am sorry I missed your comment, and as a result you concluded that I am a crank. I wish you gave me more benefit of the doubt.

However look into section 2.7 of the BLAS standard:

With a few exceptions that are explicitly described below, no particular computational order is mandated by the function specifications. In other words, any algorithm that produces results “close enough” to the usual algorithms presented in a standard book on matrix computations [13, 4, 14] is acceptable.

So BLAS does not mandate a particular computational order, precisely so that things can be vectorized or parallelized.

Is BLAS not a coherent, reliable school of thought in numerical computing? Come on.

Here is a list of claims that you and others have made in this thread:

So I took the time, and shown on an actual code that in practice, all these claims are inconsistent with the particular example of exp2. I didn't chose exp2, because it's some "some old crufty code" as you claim. I chose it because it was in the library of your choice, and because it was an example subroutine that failed with icc. I expected your reaction to be to start poking at what I did, as @simonbyrne did (thank you!), and see if this was just this particular example, or if this applies more widely (as I claim, since that has been my experience). Rather, you decided to dismiss what I did as "some old crufty code", and then changed the subject of our discussion from particular claims you made about -ffast-math (see the bullet point list above), to a discussion about whether -ffast-math should be allowed at all, and added some ad hominem accusations about some kind of world that I am supposed to operating in as a lonely crank. All I am asking is that before you start writing about me in such a way is to first understand what I am saying. Especially since you made a pretty trivial mistake in your summation comment --- and that is ok, I made mistakes too in this thread and elsewhere --- and from that you concluded that I am a crank.

yuyichao commented 6 years ago

So I took the time, and shown on an actual code that in practice, all these claims are inconsistent with the particular example of exp2.

Those claims are not accurate and the accurate version are not being proofed wrong by what you did to exp2

-ffast-math reduces accuracy arbitrarily (no bound)

This is still correct. What you've shown is that you can write code in fast-math mode on some implementations that give you the result you want. That says abosolutely nothing about what fast-math can do in general. I'll repeat that you have to do that starting by giving a precise enough definition of fast-math. Just testing precision for a particular function will never be accepted as a proof.

libraries like openlibm cannot be safely compiled with icc (with default options)

"safely" is a vague term. If you use it to mean that you can use a particular version of icc to compile it and give reasonable result, I have no doubt about it. But safely should mean reliable in this context and it means that you have to show that no past or future version of the compiler can ever produce wrong results. Again, testing one implementation is not enough, no matter how many input values you use.

if you put these (i.e., IEEE specific) operations behind function calls, then they will become much slower => it will be slower than the original

I notice you use lto there so that's not related anymore.

IEEE does not incur a performance hit for checking inf/nan

I don't see why it's related at all for what you've shown.

yuyichao commented 6 years ago

Just testing precision for a particular function will never be accepted as a proof.

I'll add that this is only true for fast-math. Testing precision for a function following ieee semantics will always be accepted as a proof, since the output is predictable and not arbitrary.