JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.54k stars 5.47k forks source link

NaN vs wild (or, what's a DomainError, really?) #5234

Open jiahao opened 10 years ago

jiahao commented 10 years ago

(Context: this issue has come up recently in #4967 and elsewhere)

We currently don't treat the results of indeterminate computations consistently between real and complex arithmetic.

Example 1: cosine

julia> cos(Inf)
ERROR: DomainError
 in cos at math.jl:277

julia> cos(complex(Inf,0))
NaN - 0.0im

Example 2: inverse

inverse of 0

julia> inv(0.0)
Inf

julia> inv(complex(0.0, 0.0))
NaN + NaN*im

This illustrates a problem with infinity and signed zero. Unlike real infinities, of which there are only two FloatingPoint values, you can have 13 possible representations of complex infinities which convey different information about the phase (argument) of the complex infinity:

complex(+Inf, +0.0) #phase is exactly 0 or approaches 0 from the first quadrant
complex(+Inf, +Inf) #phase is in (0, pi/2), i.e. the first quadrant excluding the edges
complex(+0.0, +Inf) #phase is exactly pi/2 or approaches pi/2 from the first quadrant
complex(-0.0, +Inf) #phase approaches pi/2 from the second quadrant
complex(-Inf, +Inf) #phase is in (pi/2, pi), i.e. the second quadrant excluding the edges
complex(-Inf, +0.0) #phase is exactly pi or approaches pi from the second quadrant
complex(-Inf, -0.0) #phase approaches pi from the third quadrant
complex(-Inf, -Inf) #phase is in (pi, 3pi/2), i.e. the third quadrant excluding the edges
complex(-0.0, -Inf) #phase approaches 3pi/2 from the third quadrant
complex(+0.0, -Inf) #phase is exactly 3pi/2 or approaches 3pi/2 from the fourth quadrant
complex(+Inf, -Inf) #phase is in (3pi/2, 2pi), i.e. the fourth quadrant excluding the edges
complex(+Inf, -0.0) #phase approaches 2pi from the fourth quadrant
complex(NaN, NaN) #phase cannot be determined to lie in exactly one of the above regions
#                  and hence the infinity has no valid non-NaN representation in floating point

The result is correct if we work with unsigned zeros. However, after accounting for the signed zeros in complex(+0.0, +0.0), this result should be complex(+Inf, +Inf). A DivideError might also be a reasonable alternative here.

inverse of infinities

The inverse mapping is also problematic:

julia> inv(Inf)
0.0

julia> inv(complex(Inf,0))
0.0 - 0.0im

julia> inv(complex(Inf,Inf))
NaN + NaN*im

inverse of indeterminates

julia> inv(NaN)
NaN

julia> inv(complex(0,NaN))
NaN + NaN*im

julia> inv(complex(NaN, 0))
NaN + NaN*im

julia> inv(complex(NaN, NaN))
NaN + NaN*im

Is it meaningful to distinguish between these three possible complex NaNs? It seems silly in this example, but consider also:

julia> complex(0, NaN) + complex(0, NaN)
complex(0.0,NaN)

julia> complex(0, NaN) * complex(0, NaN)
NaN + NaN*im

etc.

Example 3: roots (nonintegral powers)

julia> sqrt(-1)
ERROR: DomainError

julia> sqrt(complex(-1))
0.0 + 1.0im

julia> (-1)^(-1/2)
NaN

julia> complex(-1)^(-1/2)
6.123233995736766e-17 - 1.0im

Whereas sqrt(-1) returns the notorious DomainError, the inverse square root (as computed by x->x^-1/2 does not, but returns a NaN instead. This use of NaN is sanctioned by IEEE 754 in the specific case of a real operation with no real output.

tl;dr: if a floating-point computation returns an indeterminate value, when should it return a NaN (or any of its complex variants), and when should it throw an error like DivideError or DomainError? By my reading of IEEE 754, both of these behaviors are allowed (throwing an error would correspond to a signaling NaN which is trapped by the error handler). Each of these examples in isolation show valid behavior; however, we should be consistent.

StefanKarpinski commented 10 years ago

I think we should throw an error whenever the condition was explicitly checked for anyway.

jiahao commented 10 years ago

To summarize the prevailing thinking with @StefanKarpinski @loladiro @JeffBezanson:

Operations whose results would produce NaNs should throw Exceptions if the cost of testing a NaN would not dominate the cost of the operation.

By this rule of thumb, elementary arithmetic on real floating-point numbers would be allowed to return NaNs if necessary, but more complicated operations like complex division should throw a DivideError.

ivarne commented 10 years ago

I don't understand. What operations should be allowed to return NaN, (unless the argument is NaN)? Only /(a::FloatingPoint, b::FloatingPoint)?

jiahao commented 10 years ago

I think we only have a clear idea that real elementary arithmetic (+, -, x, /) should be allowed to produce and propagate NaNs, since the cost of testing for a NaN or inputs that would produce NaNs would be comparable (or more expensive) than the actual operation itself, but not much else.

ivarne commented 10 years ago

So all other functions (also those who does not return NaN for any real input) must check for NaN on input and throw DomainError?

JeffBezanson commented 10 years ago

No, NaN input can just propagate. We mostly don't want to produce NaNs from non-NaNs.

StefanKarpinski commented 10 years ago

The poster child is complex division: rather than generating a complex NaN when dividing by zero, just raise an error. Since complex division is so, um, complex, checking for a zero divisor is not exactly going to affect it's performance in any meaningful way. Raising and error is simpler and alerts the user to the problem sooner.

nalimilan commented 10 years ago

Why not, but if that's the only operation which would be affected, is it worth introducing an inconsistency? It would be good to have a simple rule for which functions return NaN, and which throw an error (I mean, an user-understandable rule, not something based on computing costs).

JeffBezanson commented 10 years ago

My guess at the moment is that the rule is: only real floating-point +, -, *, / can produce NaNs from non-NaNs.

JeffBezanson commented 10 years ago

I would also add complex floating-point +, -, *, / to that list. The 1/complex(0.0,0.0) case is a divide-by-zero error, so you can still get NaNs from complex / in other ways.

simonbyrne commented 9 years ago

As outlined in this discussion, a more systematic solution would be to take advantage of floating-point exception masking. This would allow users to turn exceptions on and off as required (e.g. have exceptions on for development/debugging, and off for production code), reduce the overhead (by fewer range checks) and generally make functions more vectorisation friendly.

elextr commented 9 years ago

As noted in the discussion linked above, it becomes necessary to turn off hardware floating point exceptions in some cases:

  1. calling libraries that take the valid design decision to allow NaNs to appear and then propagate and to test for them at the end
  2. calling libraries that are not expecting exceptions
  3. calling libraries that are written in non-exception aware languages (C, Fortran etc)
  4. preventing exceptions escaping Julia code that is called from languages that are not exception aware

So it becomes rather difficult to decide when to allow floating point (and by extension complex) exceptions and when to disallow them, and the one piece of Julia code may be used in many situations.

simonbyrne commented 9 years ago

How about the following proposal:

I feel that this should address the above concerns of @elextr, as well as address the needs of @eschnett and @mlubin mentioned on the mailing list discussion, and generally make us much closer to getting the proverbial Kahan tick of approval.

Possible issues:

  1. Users will not be able to reliably catch InvalidException (unless, of course, they explicitly disable masking beforehand).
  2. There doesn't appear to be a set way to trigger an InvalidException. Just using 0.0/0.0 doesn't work, as LLVM changes it to a constant NaN. The openlibm trick of using return (x-x)/0.0, where x is the argument to the function, seems to work though: this should always raise an invalid exception unless x is already a NaN (which is the usual desired behaviour).
  3. LLVM will sometimes rearrange operations involving MXCSR access incorrectly (see LLVM bug #6393). This shouldn't be a big issue for the above, as most changes will occur either globally (invoked by the user in the REPL), or around ccalls (for which LLVM should keep the correct order). But it does mean that you won't be able to enable it reliably for short snippets of Julia code.
    • Given the lack of activity on this issue, it doesn't seem like something LLVM are rushing to address. Perhaps this could be a good GSOC project for a student interested in compilers?
elextr commented 9 years ago

@simonbyrne looks a good basic proposal but for the notes below.

In a Julia session, the invalid exception is unmasked (i.e. 0.0/0.0 throws InvalidException) by default, but all others are masked (1.0/0.0 returns an Inf). We provide a command line argument (--mask-invalid?) to disable this behaviour.

Thats ok, but it forces all ccalls to mask it off. So a default of all (or most?) exceptions on is not going to be any worse and that will provide the benefits to pure Julia code. Otherwise all off.

We can provide wrappers (ccallmask?) to enable masking for the duration of the ccall.

If at least one exception is going to be enabled by default (as above) then it is probably better that the default ccall turns them all off and restores the mask and ccallmask is used for the exceptional circumstances where it is known safe to leave some on. Turning all exceptions off is going to be needed for nearly all ccalls and doing it manually is a subtlety that shouldn't be required of newcomers to Julia. Note this will also need to be applied to all ccalls inside Julia, and its libraries and packages where it can't be proven that the code ccalled is exception safe or cannot generate exceptions.

Julia code embedded in other languages, or called via cfunction should use the floating point masks of the caller.

Functions and libraries should not change floating point masks without resetting them to their original setting.

I would just add "and must catch all exceptions" to both of these.

simonbyrne commented 9 years ago

I would disagree that ccall should mask by default, as it does add to the overhead of calling external functions.

elextr commented 9 years ago

Thinking about it some more (and at this point I am talking about Julia code only, the issues with foreign code remain as above) if the FPE masks can be changed:

So to avoid failures all Julia FP code is going to have to do one of:

None of these is particularly attractive, the first is very noisy and the other two are exceptionally error prone. At this point I admit I don't have a "nice" solution.

StefanKarpinski commented 9 years ago

I fear that this would work about as well as dynamically scoped rounding modes – i.e. not well at all :-\

simonbyrne commented 9 years ago

@elextr I don't think this should be an issue: the way I see it, the whole point of raising floating point exceptions is so that you can get rid of whatever is causing them, you don't actually want to try/catch them (which is why I don't think my issue 1 above is such a problem). So a function should not expect an exception to be raised (because the function will simply stop anyway if one is raised), and should always accept NaN as an argument.

Basically, masks should only be changed for two reasons:

I would go so far as to say that packages which modify masks in any other way shouldn't be accepted in METADATA unless they can provide a good justification for why they need to do it.

simonbyrne commented 9 years ago

@StefanKarpinski I guess the flip side is that I feel that DomainError is basically trying to be an InvalidException, except that it's slow, inconsistent and inflexible.

eschnett commented 9 years ago

@simonbyrne It is often very difficult to ensure ahead of time that no nans appear in a calculation. Thus in my code, I often rather accept that the result is a nan, and fix up the problem later. These calculations happen in a tight inner loop, and handling exceptions in this loop is a non-starter, as it would prevent vectorization.

I appreciate that many people don't like having to deal with nans explicitly, because in most cases, nans result from errors e.g. in the input data. However, I quite like this part of the IEEE standard, and would like a way to use nans in Julia.

I'm not saying this should be the default, or that all library functions should be nan-safe in this way -- but there needs to be a way to use nans for efficiency in an HPC environment. I'd be happy to use a macro similar to @inbounds for this. I'd also be happy to re-write Julia's standard math functions (sqrt, exp, log) myself. However, there needs to be a reasonable way for experts to do so.

mlubin commented 9 years ago

Maybe the solution is to introduce alternate versions of the built-in math functions which return NaN instead of throwing an exception?

simonbyrne commented 9 years ago

@mlubin The problem is then the functions that call the built-in ones, and the ones that call those...

I do think the IEEE exception functionality can be made to work in a sensible fashion, and that it provides our best hope for ensuring consistency and interoperability with other programs.

mlubin commented 9 years ago

@simonbyrne, any library/package functions that call log should be expecting it to throw an exception, changing that behavior with a flag seems crazy.

eschnett commented 9 years ago

Here is another case where nans are useful:

In vectorized code, branches are expensive. It is often cheaper to evaluate both branches of an if statement, in particular if one of them is very cheap.

Take as example the expression x>0 ? log(x) : C with some constant C. The result is always well-defined. However, it may be much faster to evaluate ifelse(x>0, log(x), C) instead. A nan may be generated, but will always be ignored.

mlubin commented 9 years ago

@eschnett, good point. I've also come across cases where it can be convenient to compute a log even when it may not be used later.

elextr commented 9 years ago

@simonbyrne I agree that generating exceptions is a nice way to do a class of problems. But that should not make alternative strategies impossible, or so expensive or complicated that they are effectively impossible.

the way I see it, the whole point of raising floating point exceptions is so that you can get rid of whatever is causing them, you don't actually want to try/catch them (which is why I don't think my issue 1 above is such a problem).

NaNs can be caused by data, not only by coding errors. Real world data can be noisy and that can cause any of the conditions that generate exceptions. It is often not acceptable to throw an exception and stop processing due to one messy data point.

I would go so far as to say that packages which modify masks in any other way shouldn't be accepted in METADATA unless they can provide a good justification for why they need to do it.

Which makes the callback situation untenable, https://groups.google.com/d/msg/julia-dev/EaO1VSOXx2o/ruSZr0jhpigJ won't be able to use anything in METADATA then.

simonbyrne commented 9 years ago

@elextr My whole reason for being so keen on the floating point exceptions is that they make different strategies feasible: e.g. "0/0 is an error, I should fix it", as well as "0/0 is just a weird edge case, I can ignore those cases". The main thing is that the strategy is consistent throughout the program.

With regards to the callback situation: the way I envision it is that you would use a masked ccall, so any callbacks made during this time would return NaNs instead of throwing exceptions (though of course, I could be wrong on how that would work). If you wanted to do the same thing in a Julia library instead of a C one, then perhaps that could be regarded as a good justification (as long as the library then returns the masks to their original settings).

elextr commented 9 years ago

My whole reason for being so keen on the floating point exceptions is that they make different strategies feasible: e.g. "0/0 is an error, I should fix it", as well as "0/0 is just a weird edge case, I can ignore those cases". The main thing is that the strategy is consistent throughout the program.

Agree its fine if the whole program is consistent.

What I belatedly came to realise, is that being consistent throughout the program becomes impossible once you use external libraries that may be written to the opposite strategy to yours.

As I said previously, this leads to the the situation where all functions need to set and restore the FPE mask to suit their strategy, in case they are called from code with a differing strategy. Or all code needs to know what strategy the functions they call use.

And that applies to callbacks as well. I agree that callbacks from C will probably be called with exceptions off, since the calling C code will be running with them off but callbacks from (for example) solvers written in Julia may be called with exceptions on. But whichever it is, then they call other libraries, including base, and then the paragraph above applies.

mlubin commented 9 years ago

What I belatedly came to realise, is that being consistent throughout the program becomes impossible once you use external libraries that may be written to the opposite strategy to yours.

Agreed here. The logical alternative is to provide versions of the basic functions that return NaNs. This at least allows everyone to be consistent according to the situation. I'm guessing that typically one would want a DomainError to be thrown for user friendliness, but in specialized cases, e.g., vectorization, interacting with C code, etc., switching log to log_nan (horrible name) makes more sense.

staticfloat commented 9 years ago

log�()? ;)

elextr commented 9 years ago

@mlubin that requires that the code of every library that you want to use (including the C ones) is changed to use log_nan since thats what the current behaviour of log is.

It would be better if the new behaviour has the new name that generates exceptions, eg ex_log as an example of an equally bad name :)

mlubin commented 9 years ago

Following the open-source philosophy, I've created a package which solves my problem: https://github.com/mlubin/NaNMath.jl.

simonbyrne commented 9 years ago

I had a bit of time on Friday to try out my idea: https://github.com/simonbyrne/julia/tree/fpe when combined with my mxcsr script gives some interesting results.

On Linux it seems to work fairly well, the only problem being that after throwing one error it seems re-enables the mask. Hopefully there is a sensible way to fix this. One idea would be to use the "official" feenableexcept interface, but I haven't had a chance to look at it yet.

On OS X, my experience was more frustrating: it does throw exceptions at the correct times, but it doesn't seem to return the correct exception code that indicates the type of exception (Invalid, Inexact, DivByZero, etc). Unfortunately, it seems that OS X doesn't officially support throwing floating point exceptions at all (e.g. see here), so we might have a hard time getting Apple to fix it.

If anyone has used this sort of stuff in C or C++, I'd appreciate any help or suggestions that I could try.

eschnett commented 9 years ago

@simonbyrne This example http://www-personal.umich.edu/~williams/archive/computation/fe-handling-example.c can be found by chasing the link you provided. It seems to provide an implementation of the necessary functionality for OS X, both for PPC (!) and Intel Macs.

The respective function calls translate to a few machine instructions, so we don't need any Apple support for this.

simonbyrne commented 9 years ago

Thanks @eschnett. It looks pretty similar to what I was doing, but I'll try it out. I seem to recall that the problem was the signals returned by siginfo were completely arbitrary.

StefanKarpinski commented 7 years ago

Bump.

oscardssmith commented 7 years ago

What would happen if functions had a nan mode in them, so sqrt would be sqrt(x,nan=false) with a default value? To me this seems to let libraries do what they want, let users do what they want, and not mess with code generation.