RFC: Uncatchable FatalException. Separating bugs from recoverable errors.

JuliaLang / julia

The Julia Programming Language

https://julialang.org/

MIT License

45.14k stars 5.43k forks source link

RFC: Uncatchable FatalException. Separating bugs from recoverable errors. #15514

Open samoconnor opened 8 years ago

samoconnor commented 8 years ago

The Midori Error Model has two types of error handling:

abandonment for programming bugs (stop the process immediately), and
a try/catch mechanism for handling for recoverable errors (this has similarities with the "chain of custody" exception handling idea: https://github.com/JuliaLang/julia/issues/7026#issuecomment-181608252 ).

Combining the ideas from #7026 with Midori's throws method annotation and auditable call sites seems like a promising way forward for Julia's try/catch mechanism.

This RFC focuses on "abandonment" for handling bugs. The eventual refinement of the try/catch mechanism can be dealt with seperately. The proposal here is to add a simple fail-fast bug handling mechanism that can co-exist with whatever the final try/catch design turns out to be.

The proposed approach is to:

add an abstract FatalException type,
insert if isa(ex, FatalException) rethrow(ex) at the top of every catch block (i.e. make FatalException uncatchable.

A selection of exception types could be made <: FatalException. e.g. perhaps ArgumentError, AssertionError, StackOverflowError, OutOfMemoryError, UndefVarError (Joe Duffy's blog post has a list of the error types that were treated as fatal in Midori ).

This would immediately make things a little safer in all the places where existing Julia catch blocks currently catch more than was intended.

The downside would be that the REPL would crash hard at the first FatalException. The Midori answer to this would be that the REPL should start a seperate process to execute the code that the user types in.

Another approach would be to have a special catchfatal keyword for use in the REPL. There are a few other cases where catchfatal is also needed. e.g. test frameworks; a server that needs to log FatalExceptions before exiting; and the remote side of remotecall_fetch that needs to catch and serialise the FatalException.

Low-level library exception types should probably be made <: FatalException by default. e.g. UVError see https://github.com/JuliaLang/julia/issues/14972. In most cases the UVError should be translated into something meaningful like HostNotFoundError (which would not be fatal), but any raw UVError occurrences that slip between the cracks should be fatal so that they are noticed and fixed.

Joe Duffy says:

Abandonment, and the degree to which we used it, was in my opinion our biggest and most successful bet with the Error Model. We found bugs early and often, where they are easiest to diagnose and fix. Abandonment-based errors outnumbered recoverable errors by a ratio approaching 10:1, making checked exceptions rare and tolerable to the developer.

nalimilan commented 8 years ago

I don't think we need a special FatalException type to implement this abandonment model. Any uncaught exception already triggers "abandonment" (i.e. process exit). What we need is to prevent the caller from catching unanticipated exceptions. This is what #7026 is about.

If the only thing FatalException would change is that people would need to use catchfatal instead of catch, then it's not very useful. Better use a more general mechanism like catch PosDefException which would only allow catching this particular type of exception (maybe under more conditions are discussed at #7026). That way, any exception will be fatal if no procedure has been written to handle it, not only FatalExceptions.

samoconnor commented 8 years ago

@nalimilan, as I see it, the benefit of the Midori model arrises from having a class of errors which are not catchable at all. "catchfatal" above is mentioned only as a way to discuss the issue "uncatchable" errors for the REPL etc, it is not intended that normal user code would ever use catchall.

Perhaps a better interface would be to have a non-exported @catchfatal for the REPL etc...

The key thing is that from the ordinary user's point of view there are a class of errors that cannot be caught.

Consider a practical example: If AssertionError was made uncatchable, and a user has some existing code that relies on catching an AssertionError, then surely they are violating Stefan's rule that "catching exceptions in Julia is not considered a valid way to do control flow" (discussed in #14972). If the user case is legitimate, the error can be changed to have a specific catchable type, e.g. ResourceBusyError. The #14972 discussion suggests that catching unpredictable errors like AuthTokenExpired, or DNSUnavailable is an exception to the rule.

You could see this as a bargain struck between tool designer who is trying to make the tool safe and efficient and the tool user who just wants to get stuff done: "catch is not the preferred way to do flow control. There are some cases where you may need to catch non-deterministic errors. If you want to do this you must define specific un-ambiguous catchable error types. You cannot catch non-sepficic errors."

JeffBezanson commented 8 years ago

I do like the idea of something like this. The idea of catching, say, an UndefVarError is crazy enough to warrant some special treatment at the language level.

I agree with @samoconnor that catching exceptions by type is not quite enough. It's too difficult to know exactly which types of exceptions to catch. Ideally, the default behavior for catch x should be "catch everything that is reasonable to catch". Joe Duffy argues convincingly that there is a good definition of "reasonable" here.

There is a strong connection here to error checks that can be optionally disabled, like bounds checks. If all disable-able checks are handled with abandonment, you can be sure correct programs will work with --check-bounds=no.

samoconnor commented 8 years ago

@JeffBezanson can you tell me if I'm on the right track for a quick-and-dirty proof-of-concept implementation of uncatchable...

I'm thinking that the condition if isa(e, FatalException) rethrow(e) could be prepended to the catch block in (emit```(enter ,catch)) on this line: https://github.com/JuliaLang/julia/blob/6b5a05eb1a029aef93f77b60bb1d745c7d6e1d8d/src/julia-syntax.scm#L3073

So, instead of this:

expand(:(try error("foo") catch e println(e) end))
:($(Expr(:thunk, AST(:($(Expr(:lambda, Any[], Any[Any[Any[:e,:Any,18]],Any[],1], :(begin
        $(Expr(:enter, 0)) # none, line 1:
        GenSym(0) = (Main.error)("foo")
        $(Expr(:leave, 1))
        return GenSym(0)
        0:
        $(Expr(:leave, 1))
        e = $(Expr(:the_exception)) # none, line 1:
        return (Main.println)(e)
    end))))))))

... you would get this:

:($(Expr(:thunk, AST(:($(Expr(:lambda, Any[], Any[Any[Any[:e,:Any,18]],Any[],1], :(begin
        $(Expr(:enter, 0)) # none, line 1:
        GenSym(0) = (Main.error)("foo")
        $(Expr(:leave, 1))
        return GenSym(0)
        0:
        $(Expr(:leave, 1))
        e = $(Expr(:the_exception)) # none, line 1:
        unless (Main.isa)(e,Main.FatalException) goto 2 # none, line 1:
        (Main.retrhow)(e)
        2:  # none, line 1:
        return (Main.println)(e)
    end))))))))

A quick hack like this would allow experimenting with making e.g. UndefVarError and AssertionError uncatchable. i.e. Does this stop any code form working?

StefanKarpinski commented 8 years ago

I agree that uncatchable exceptions make sense – I suspect they should terminate the current task. That means you can still write processes that can carry on so long as some other task is still running.

tkelman commented 8 years ago

Having read that midori article several times now, I'm not convinced why we would want this. Uncatchable exceptions are a bit like private methods, in that they're a statement by the library writer that leverages them that he knows better than all his users, and is just forcing them to go looking for workarounds. For writing an operating system, that is a safe bet and a reasonable design decision. For the things people use Julia for, not so much. Even UndefVarError could have uses in exploratory automated generation of code. Things that seem like "obvious bugs" in isolation can be recoverable exception situations that trigger backtracking or mode switching in real algorithms (e.g. "restoration mode" in many optimization solvers).

StefanKarpinski commented 8 years ago

It could potentially make throwing uncatchable exceptions much more efficient – given the current cost of adding a single error return to a function call, that would be good. I also just don't really believe that people catch these sorts of things correctly in general. At least with a bugs-kill-the-task approach, you know exactly what you're catching – a failed task – and there's a reasonable possibility that the task that's catching the failure is in working order.

samoconnor commented 8 years ago

My only concern with having just bugs-kill-the-task is that abandonment-on-bug behaviour could be lost in a situation like using pmap (where the user code runs on a seperate task without them ever writing @async). But, I'm sure the detail of that could be worked out.

johansigfrids commented 8 years ago

As I read Joe Duffy one of the things that made abandonment feasible on Midori was very light-weight processes, so if you wanted robustness in face of abandonment you could spin up a new process easily and do the processing there. I imagine a web server on Midori would do one process per request, so that even if one request did something crazy it would only kill that process, not the entire web server.

Rust does something similar with its panics, except there a panic only tears down the current thread, not the entire process. This does not isolate the potential fallout of a logic bug as well, but I suppose it is a better performance trade-off when running on systems with more heavy-weight processes.

Personally, having recently had to debug a subtle logic bug that had been feeding junk data into a database in production for six months I've developed a real preference for systems where bugs show up early and in spectacular fashion.

nalimilan commented 8 years ago

The problem I have with this is that it's a relatively complex system already, and yet it does nothing to prevent people from catching e.g. a DomainError with a catch that was only designed to catch an InexactError.

So I'm suggesting that (like in #7026) one would never be able to catch an exception type which wasn't explicitly mentioned after catch. With such a rule, if somebody writes catch UndefVarError, that's his/her problem, just like overriding +(::Int, ::Int) or modifying private fields is dangerous. I'd say we should design the general "chain of custody" system before deciding whether fatal exceptions are needed/useful.

@StefanKarpinski Why should throwing a fatal error be fast? Are you proposing they would become a standard way to do control flow? Else, I don't really see the point. I also think @tkelman is right that sometimes it's useful to be able to catch even the most fatal exceptions for debugging or to temporarily work around an ugly bug in library code.

StefanKarpinski commented 8 years ago

@johansigfrids: Julia's tasks are lightweight enough to be used the way you describe. In fact, if you want to write a server that handles multiple requests concurrently (which you do), then you need to spawn a task for each of them anyway. Tearing down a thread won't really make any sense since our threading model will have tasks as the unit of work, and those will be mapped onto threads by the work scheduler. In other words, threads belong to the system, tasks belong to the program.

@nalimilan: What I was thinking (vaguely) in terms of performance is that we currently have to worry about unwinding the stack, figuring out where to unwind to, constructing an error object, etc. The problem isn't so much the time it takes to do this but the optimizations that being prepared to do it prevents. The task abandonment on bugs approach would make errors terminate the current task, which amounts to just putting the task in the "error" state and returning to the scheduler, all the hard work would be done on the handling side – whatever task is waiting on this one would have the entire task and its stack to figure out what happened. But the open questions are: can we avoid causing a GC frame in the caller of a method that errors, and can we avoid having errors prevent inlining of otherwise simple methods? I don't know, but if all you have to do is terminate the thread and call the scheduler, it does seem plausible that this could be easier.

I wholeheartedly agree that catching the wrong error is way too easy right now. This would mitigate that problem by making a whole class of errors that you shouldn't be catching at all just bypass any catch block. Joe Duffy's main point about separating out bugs from I/O exceptions and the like, is that it makes exceptions that are catchable far less common – otherwise it's impossible to write any code anywhere that isn't riddled with catchable exceptions. By distinguishing errors from exceptions and making only the latter catchable, the number of places where you have to worry about true exception handling is reduced to a manageable level.

The key issue is that the set of catchable exceptions a function can throw are really part of its signature: they are also ways for the function to return, and if you want to write a correct program you need to handle them. The "chain of custody" proposal makes this explicit by requiring you to annotate call sites with throws FooException for any exceptions that you expect. This is much like how you have to write r = f(...) to explicitly get a return value back. @JeffBezanson's main objection – which is entirely fair given how many things we currently treat as catchable exceptions, i.e. everything – is that we'd have to put this sort of throws annotation all over the place. But if most of the exceptions we currently throw are simply programmer errors and therefore uncatchable, then they cease to be part of a function's API and won't force us to put throws all over the place.

In Midori, the compiler forces you to handle catchable exceptions – programs won't compile, let alone run unless you handle all catchable exceptions. In Julia, we won't do that – unless you opt into it by running some kind of static code analysis tool on your program. But what we can do is convert an unexpected exception into a task-terminating error – because failing to handle an exception is a programmer error. This gives Julia libraries flexibility to evolve their APIs and introduce exceptions where they didn't previously exist: in Midori, would causes a compile time error, but in Julia programs would continue to work, raising errors if unexpected exceptions occur; if you do run static analysis tools on your code beforehand to detect unhandled exceptions, then you would get a warning about any new unhandled exceptions when you upgrade dependencies, and get a chance to handle them – but your code will still run.

StefanKarpinski commented 8 years ago

One way to think of catchable exceptions in the chain of custody model – and a possible way to implement them if we can reduce the class of exceptions sufficiently – is that they are literally part of the function signature and that the throws stuff is just syntactic sugar for adding keyword arguments for each exception-throwing site. In other words when you write this:

function bar(a, b)
    # before
    throw BarException()
    # after
end

function foo1(x)
    # before
    bar(2x, y) throws BarException
    # after
end

function foo2(x)
    # before
    bar(2x, y)
    # after
end

It is really a shorthand for writing this:

function bar(a, b, handleBarException=error)
    # before
    handleBarException(BarException())
    # after
end

function foo1(x; handleBarException=error)
    # stuff
    bar(2x, y, handleBarException=handleBarException)
    # more
end

function foo2(x)
    # stuff
    bar(2x, y)
    # more
end

Obviously for stuff like array indexing, we can't afford this kind of implementation, but if we reduced catchable exceptions to things like I/O and other non-bug conditions, then it might be a perfectly reasonable implementation. An interesting aspect of this implementation approach is that recovering from exceptions is trivial – just provide a handler that returns a value. I'm not sure if we'd want to do that or not.

samoconnor commented 8 years ago

@StefanKarpinski it is interesting that your CoC model does almost the same thing as Midori, but opposite. i.e Midori has no type annotation at the call site (just a try annotation) and has a throws Type annotation on the method definition.

I like the exception type annotation on the method signature because it is nice self-documentation.

I worry that burden of annotating the type at the call site might discourage creation of fine-grained exception types (because call sites would end up with a growing list of throws Union{AuthExpired, PermissionDenied, Throttled, NetworkTimeout} at each call site). In the Midori model, just the unhanded types are annotated in the throws signature of the enclosing method. This encourages handling exceptions (to avoid long method signatures). The simple Midori try annotation at the call site retains some of the benefit of the CoC callsite annotation in that it draws attention the the possible exception in a code review.

A counter argument would be that: if it is poor form for an API method to return more than one or two exception types; and if catchable exceptions are outnumbered by hard errors 10:1; then throws type is not too verbose; and doesn't have to be used much anyway. Also, if you're writing some kind of low-level driver code that has to handle a bunch of different exceptions and propagate them up the stack a bit, you could always define expected_errors = Union{FooError, BarError, ...} and do f(x) throws expected_errors.

I'm interested to know what you think about this type-at-definition vs type-at-call-site tradeoff.

For those who haven't studied the Midori blog post, the example above would look like this "the Midori way":

function bar(a, b) throws BarException
    # before
    throw BarException()
    # after
end

function foo1(x)
    # before
    bar(2x, y)          <- error missing "try"
    # after
end

function foo2(x)        <- error missing "throws"
    # before
    try bar(2x, y)      <- "try" means rethrow whatever bar() throws. 
    # after
end

function foo3(x) throws BarException
    # before
    try bar(2x, y) 
    # after
end

See Easily Audible Callsites .

The try at callsite way also has these syntactic sugar variations:

Alternate value on exception:

i = try foo(x, y) else 7

Exception as value (kind of like Nullable)...

type Result{T}
    value::T
    exception
end

result = try foo(x, y) else catch
if is_failure(result)
    log(result)
    throw(result.exception)
end
println(result.value)

Exception as value propagation (like a general form of NaN)...

x = try foo() else catch
y = try bar else catch
z = x + y

See Syntactic Sugar .

samoconnor commented 8 years ago

Why should throwing a fatal error be fast?

@nalimilan what Jeff said above hints at the reason:

If all disable-able checks are handled with abandonment, you can be sure correct programs will work with --check-bounds=no.

e.g. If you assume that @precondition causes abandonment, then:

In "normal mode" The compiler can optimise the body of a function under the assumption that the domain is constrained by the precondition (but it must still allow for graceful task termination).
In "safe release mode", the precondition check could cause a hard process exit, this provides more opportunity for optimisation, unrolling, inclining etc.
In "full release mode" (for fully tested code) the precondition check can be removed completely.

Other performance opportunities include:

"hybrid release mode" where preconditions are only checked for inter-module calls.
Parallel contract checking, where one thread runs the code as if preconditions never fail while a helper thread evaluates a backlog of preconditions. (The __builtin_expect branch-prediction hint does this down at the instruction level http://llvm.org/docs/LangRef.html#llvm-expect-intrinsic ).

Keno commented 8 years ago

How would generic pass through functions like open work if all exceptions have to be explicitly annotated? Have a generic super type of all exception?

nalimilan commented 8 years ago

I guess my proposal boils down to this: functions have to be annotated via throws if they raise catchable exceptions; all other exceptions that may happen during their execution are fatal, i.e. lead to abandonment. So whether an exception is fatal wouldn't depend on its type, but on whether it was part of a function's signature or not.

Then the details of how abandonment happens can vary depending on compilation options: in release mode, the program would just abort. But in debugging mode or at the REPL, the exception would still be catchable using e.g. catchall, which would be quite convenient (you don't want the REPL to crash just because you triggered an UndefVarError).

This doesn't mean we shouldn't have guidelines about which exceptions should be considered fatal by function writers. For example, we would advise not to add throws ArgumentError to a function's signature: in general, this exception should be considered as a programmer error, not to be caught by the caller. On the contrary, errors due to connectivity issues, which cannot be anticipated by the caller, must be part of a function signature. But connectivity issues would be also be considered as fatal errors in other cases, for example if they happen deep in a call tree in a function which doesn't have any code path to handle it (which is a programmer error).

tkelman commented 8 years ago

This needs to be possible to work around even in release mode. If library A throws something that it considers fatal, and library B which wraps it does not handle that case, I can guarantee there will be cases where you need to call library B in a way that these "fatal" errors from inside A are recoverable. Classifying exceptions and degrees of fatalness is such a subjective thing, one decision will not be appropriate for all use cases. I'd hate to have to copy all of library B wholesale and need to modify its exception handling annotations to be able to do this. Or have to introduce Tasks for every single computation when so many computational use cases have so far been able to ignore the existence of the Task programming model entirely.

StefanKarpinski commented 8 years ago

@samoconnor: I appreciate the easily auditable call site thing, but it requires an amount of static analysis that we don't – and generally can't – do in Julia. When you see bar(2x, y) in a Julia function, which method of bar is being called? Since Julia is dynamic and highly polymorphic, we can't know this statically. We could, I suppose make it a runtime error to call a method that throws an error without the try but that's a very different situation than Midori's compile-time refusal to allow you to call bar without the try. At that point, since you don't get a compile-time warning anyway, it seems a bit strange to go halfway and get a runtime warning about an error that may or may not happen. It's not very Julian.

One way around this would be to make the throws property a function-level one instead of a method-level one. But you still have the potential for situations like this:

function foo0(x, y)
    bar(x, y) # parse-time error?
end

function bar throws BarException end # some syntax for declaring this

function foo1(x, y)
    bar(x, y) # parse-time error
end

function foo2(x, y)
    b = randbool() ? bar : +
    b(x, y) # error?
end

function foo3(b, x, y)
    b(x, y) # could be bar, depending on how it's called
end

If the throws property is a function-level thing, then we can make it an error to call bar without the try as shown in the body of foo1. But what if bar is called before the declaration of foo1? We generally allow this sort of thing in Julia. Is that still a parse-time error? I guess the error occurs when the declaration of bar occurs? That's weird. It's also unclear that we can enforce this given the loose way that Julia lets you dynamically assign functions and call them. Consider foo2 – is that an error or not? Similarly, foo3 could be an error or not depending on what argument is passed to it. In a static language, whether a function throws or not would be encoded in its type and that would be tracked and checked at compile time. We don't and generally can't do that sort of thing in Julia.

The only way that we can in general, without completely changing the dynamic nature of the language, do this sort of thing is in an opt-in static mode where we do that sort of checking and separate the program into code that we know is ok, that we don't know about, and that we know is wrong.

Midori is trying to do something different than what my chain of custody proposal is aiming at: Midori's approach ensures that you cannot call any function without handling all possible exceptions (counting explicitly ignoring them as "handling"); the chain of custody proposal ensures that if an exception occurs that you didn't expect, a fatal error is raised, rather than the exception being caught by code expecting a different condition. Both approaches have in common that they make sure that when you catch an exception, it's actually the one you expected to catch, which can easily not be the case in Julia currently.

eschnett commented 8 years ago

Would it be possible to map the exception throwing and handling mechanism onto function arguments? I'm not proposing to use that as actual syntax, but because Julia's semantics for function calls and argument type matching are very well defined, defining a new mechanism in terms of this would avoid the need to explicitly handle exception declarations ("throws") in Julia's run-time system.

For example, a function that might throw 3 different exceptions might be represented internally as a function that takes 3 keyword arguments with particular reserved names and types. In this way, a mismatch would be detected, and the method selection mechanism would ensure that a function returning exception E can't be called from a site that doesn't handle exception E.

If keyword arguments don't work for this, then maybe a single argument with a parameterized type Exception{Union{... all handled exceptions ...}} could be added to the call.

StefanKarpinski commented 8 years ago

@nalimilan: your proposal is pretty similar in spirit to mine. The main differences, afaict, is whether you annotate the method signature with throws or put throws annotation inside the method body, and your proposal of catchall. In my original chain-of-custody proposal, I left our current try/catch form as what your calling catchall (has the advantage of being backwards compatible) and didn't introduce any notion of uncatchable errors.

I think some of the reason @JeffBezanson didn't like my proposal may have had to do with him not understanding and me not conveying that the number of functions with throws annotations would be relatively limited: only functions whose official API includes throwing errors that we intend for people to be able to catch would need to do this. If we consider, e.g. out-of-bounds indexing to be a programmer error, that implies that the getindex API doesn't need to include throwing BoundsError since if you've caused a bounds error, you've made a mistake and you shouldn't be able to catch it. The throws annotation would only go on I/O APIs and the like, where exceptions can occur without programmer error.

StefanKarpinski commented 8 years ago

@eschnett: that's essentially what I proposed above with my handleBarException transformation.

nalimilan commented 8 years ago

@StefanKarpinski You're right, I was mostly adapting your plan to this issues' proposal regarding fatal errors. Maybe after this discussion distinguishing exceptions that are supposed to be caught (and therefore mentioned in the annotation) from others, Jeff will be more convinced...

samoconnor commented 8 years ago

From the original issue description:

This RFC focuses on "abandonment" for handling bugs. The eventual refinement of the try/catch mechanism can be dealt with seperately. The proposal here is to add a simple fail-fast bug handling mechanism that can co-exist with whatever the final try/catch design turns out to be.

There is much discussion above about refinement of the try/catch mechanism. Obviously more discussion is needed before a conclusion is reached.

Putting that aside, and returning to the issue of "abandonment" for handling bugs, my question to @JeffBezanson and @StefanKarpinski is: would you support a minimal PR that adds an uncatchable exception type?

samoconnor commented 8 years ago

See #15906. This implements a very simple fatal error mechanism.

The intention of this WIP PR is:

to experiment with the isfatal function interface as a way of determining which errors are fatal
to find out if any code is broken by making a few selected error types uncatchable

If the general idea is accepted, there is lots of scope for the performance optimisations suggested by Jeff and Stefan to be added later. (e.g. disabling some checks in release mode, immediate task termination etc)