What to do about asynchronous exceptions

Keno commented 1 year ago

Following up on recent discussions caused by our improvements to the modeling of exception propagation, I've been thinking again about the semantics of asynchronous exceptions (in particular StackOverflowError and InterruptException). This is of course not a new discussion. See e.g. #4037 #7026 #15514. However, I because of our recent improvements to exception flow modeling, I think this issue has gained new urgency.

Current situation

Before jumping into some discussion about potential enhancements, let me summarize some relevant history and our current state. Let me know if I forgot anything and I'll edit it in.

InterruptException

By default, we defer interrupt exceptions to the next GC safepoint. This helps avoid corruption caused by unwinding over state in C that isn't interrupt safe. This helps a bit, but of course, if you are actually in this kind of region, your experience will be something like the following:

julia> randn(10000, 10000) * randn(10000, 10000)
^C^C^C^C^C^CWARNING: Force throwing a SIGINT

Which isn't any better than before (because we're falling back to what we used to do) and arguably worse (because you had to press ^C many times).

StackOverflowError

Stack overflow error just exposes the OS notion of stack overflow (i.e. if something touches the guard page, the OS sends us a SEGV, which we turn into an appropriate Julia error). We are slightly better here than we used to be since we now at least stack probe large allocations (https://github.com/JuliaLang/julia/pull/40068).

Nevertheless, this again is still not particularly well defined semantically. For example, is the following actually semantically sound:

https://github.com/JuliaLang/julia/blob/187e8c2222878c68b2afc9295ab8dc61773bd7f2/base/strings/io.jl#L32-L40

I think the answer is probably "no", because setting up the exception from touches the stack, so we could be generating a stackoverflowerror after the lock, but before we enter the try/finally region. Additionally, if we are close enough to the stack bottom to cause a stack overflow, there's no guarantee that we won't immediately hit that same stack overflow again trying to run the unlock code.

Recent try/finally elision

On master, inference has the capability to reason about the type of exceptions and whether or not catch blocks run. As a result, we can end up eliding try/finally blocks if everything inside the try block is proven nothrow:

@noinline function some_long_running_but_pure_computation(x)
       for i=1:10000000000
           x ⊻= x >> 7
           x ⊻= x << 9
           x ⊻= x << 13
       end
       return x
 end

function try_finally_elision(x)
       println("Hello")
       try
           some_long_running_but_pure_computation(x)
       finally
           println("World")
       end
end

As a result, we can get the following behavior:

julia> try_finally_elision(UInt64(0))
Hello
^C^C^C^C^C^CWARNING: Force throwing a SIGINT
ERROR: InterruptException:
Stacktrace:
 [1] xor(x::UInt64, y::UInt64)
   @ Base ./int.jl:373 [inlined]
 [2] some_long_running_but_pure_computation(x::UInt64)
   @ Main ./REPL[15]:4
 [3] try_finally_elision(x::UInt64)
   @ Main ./REPL[18]:4
 [4] top-level scope
   @ REPL[19]:1

i.e. the finally block never ran.

Some thoughts on how to move forward

I don't think I really have answers, but here's some scattered thoughts:

If necessary, we can fix the try/finally elision thing by adding the asynchronous exceptions into the exception set (with an option to disable this for external abstract interpreters as needed), but of course we'd lose the optimization benefits from the modeling precision.
I think these sorts of asynchronous exceptions should likely not participate in the ordinary try/catch mechanism at all (as previously proposed e.g. in #7026).
If we do have asynchronous exceptions, this likely implies that we need to separate the IR representations of try/finally and try/catch.
If we have asynchronous exceptions, we need to be very clear when these sort of asynchronous exception can occur (as in the try/lock/unlock example above). For StackOverflowError in particular, I think it would be reasonable to specify that exceptions can only occur at function call boundaries.
It seems tempting to try to avoid asynchronous StackOverflowErrors entirely. The two primary solutions here are: Segmented stacks and stack copying. https://without.boats/blog/futures-and-segmented-stacks/ has a pretty good overview of this. It is worth treading cautiously here however. As the linked article notes, both Go and Rust ended up backing out segmented stacks again. On the other hand, we devirtualize significantly more aggressively than rust and don't use as much variable-sized stack space. We also have much more control over our native dependencies, so it's entirely possible that we wouldn't run into these issues.

A possible design for cancellation

I think the general consensus among language and API designers is that arbitrary cancellation is unworkable as an interface. Instead, one should favor explicit cancellation requests and cancellation checks. In that vein, we could consider having an explicit @cancel_check macro that expands to:

if cancellation_requested(current_task())
throw(InterruptException())
end

For more complex use cases cancellation_requested could be called directly and additional cleanup (e.g. requesting the cancellation of any synchronously launched I/O operations or something). As an additional, optimization, we can take advantage of our (recently much improved) :effect_free modeling to add the ability to reset to the previous (by longjmp) cancellation_requested check if there have been no intervening side effects. This extension could then also be used by external C libraries to register their own cancellation mechanism, in effect giving us back some variant of scoped asynchronous cancellation, but only when semantically unobservable or explicitly opted into.

That of course leaves the question of what would happen if there is no cancellation point set. My preference here would be to wait a reasonable amount of time (a few seconds or so, bypassable by a second press of ^C) and if no cancellation point is reached in time,

Print a backtrace of the current location, with a suggestion to request for more cancellation points to be added by the appropriate package author
A return to the REPL on a new task/thread.

This way, we never throw any unsafe asynchronous exceptions that could be corrupting the process state, but give the user back a REPL that they can use to either investigate the problem, or at least save any in progress work they may have. There's very little things more frustrating than losing your workspace state, because the ^C you did happened to corrupt and crash your process.

One final note here is to ask the question what should happen while we're in inference or LLVM. Since they are not modeled, we are not semantically supposed to throw any InterruptExceptions here. With the design above, the answer here would be that on entry, we would stop infering/compiling things, instead proceeding in the interpreter in the hope to hit the next cancellation point as quickly as possible. If cancellation becomes active while we are compiling, we would try to bail out as soon as feasible.

My recommendation

Having written all this down, I think my preference would be a combination of the above cancellation proposal with some mechanism to avoid StackOverflowErrors entirely. I think to start with I think we could enable some sort of segmented task stack support, but treat triggering this as an error to be thrown at the next cancellation point. I think we should also investigate if we can more fully model a function's stack size requirements since we tend to be more aggressively devirtualized. If we can, then, we could consider using a segmented stack mechanism more widely, but I think even if there is some performance penalty, getting rid of the possibility of asynchronous exceptions is well worth it.

Keno commented 1 year ago

This was originally a slack thread, so I missed some prior discussion, but cancellation was also extensively discussed in https://github.com/JuliaLang/julia/issues/33248.

Keno commented 1 year ago

@gbaraldi raised the question of what to do on allocation failure. Copying my response here:

I think we need to separate allocation failure into two separate things.

Allocation failure from trying to allocate some huge array or other variable sized array of memory
Allocation failure from implicit allocation (boxing, etc) that just happens to run out of OS memory.

I think it's fine to have an explicit, synchronous exception for the former. I think the latter just needs to freeze the task - potentially with some notification to an oom monitor. We just do way too much explicit allocation to have this turn into an exception. The optimizer should also have liberty to turn the former into the latter (if it can prove the allocation is small, or if it knows the allocation can be elided entirely regardless of size).

Keno commented 1 year ago

@gbaraldi also points out that I should link #49541, although that proposal is mostly orthogonal, because it's about how to figure out what to cancel on ^C, but not necessarily how to cancel it.

bvdmitri commented 1 year ago

A return to the REPL on a new task/thread.

Does it mean that Ctrl-C would not actually interrupt the process, but only "hide" that it keeps running? If the answer is yes, than that seems like a huge footgun, since the "interrupted" function would still keep running in background potentially overriding some mutable state (vectors, matrices) while user trying to save/inspect them

or at least save any in progress work they may have

There's very little things more frustrating than losing your workspace state

It would be even more frustrating to realise that the "interrupted" process actually has not been interrupted and keeps changing the workspace state in the background

Keno commented 1 year ago

If the answer is yes, than that seems like a huge footgun, since the "interrupted" function would still keep running in background potentially overriding some mutable state (vectors, matrices) while user trying to save/inspect them

Yes, but this is the "emergency fallback mode", and would have to be appropriately messaged to the user. I'm thinking red flashing REPL prompt or something with a big red warning message.

It would be even more frustrating to realise that the "interrupted" process actually has not been interrupted and keeps changing the workspace state in the background

I think it would be fine to suspend the hung thread by default with an option to resume it explicitly using some command. Again 99% of users are not expected to ever hit this state. It's supposed to improve the situation where current you're getting crashes, segfaults and arbitrary memory writes on ^C.

bvdmitri commented 1 year ago

Yes, but this is the "emergency fallback mode", and would have to be appropriately messaged to the user. I'm thinking red flashing REPL prompt or something with a big red warning message.

Perhaps the "emergency fallback mode" should use a different combination instead of Ctrl-C. SIGINT has a clear semantic of interrupting the process, and it doesn't necessarily imply that the user should regain control at any cost. It might be counterintuitive to let the user "interrupt" the process and then inform them that it hasn't actually been interrupted with flashy messages. Instead of abruptly exiting after multiple Ctrl-C presses, Julia could display a message suggesting a different keyboard combination for entering the "emergency fallback mode."

I think it would be fine to suspend the hung thread by default with an option to resume it

It sounds like a good idea, but it's not what I would expect from the intended function of Ctrl-C. There are specific signals (SIGSTOP and SIGCONT) designed for explicit stopping and resuming.

Keno commented 1 year ago

This is mostly about the behavior in the REPL. In that instance, Julia is taking on job control responsibilities, and there's no reason to expect to require that to match POSIX job control semantics. That said, I think the idea of separating cancellation and suspension is reasonable. We'd have to play with it and see what people like best.

NHDaly commented 11 months ago

avoid StackOverflowErrors entirely

@Keno: A proposal here for small incremental progress towards that goal: https://github.com/JuliaLang/julia/issues/50603

jariji commented 10 months ago

Nathaniel J Smith of "go statement considered harmful" fame has written about ctrl-c and cancellation in his Trio structured-concurrency library:

oscardssmith commented 4 days ago

note that this actually can cause segfaults currently:

julia> using Random
julia> function h(rng)
           try
               while true
                   rand(rng) < 0 && throw(DomainError("bad"))
               end
           catch e
              print(e)
           end
       end
h (generic function with 9 methods)

julia> h(Xoshiro())
^C^C^C^C^CWARNING: Force throwing a SIGINT
DomainError(#undef, 
[519449] signal 11 (128): Segmentation fault
in expression starting at REPL[33]:1
typekeyvalue_hash at /home/oscardssmith/julia/src/jltypes.c:1828 [inlined]
lookup_typevalue at /home/oscardssmith/julia/src/jltypes.c:1136
jl_inst_arg_tuple_type at /home/oscardssmith/julia/src/jltypes.c:2479
arg_type_tuple at /home/oscardssmith/julia/src/gf.c:2388 [inlined]
jl_lookup_generic_ at /home/oscardssmith/julia/src/gf.c:3432 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3490
_show_default at ./show.jl:504
show_default at ./show.jl:487 [inlined]
show at ./show.jl:482 [inlined]
print at ./strings/io.jl:35
unknown function (ip: 0x7f81fc8e2b76) at (unknown file)
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3306 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3494
print at ./coreio.jl:3
h at ./REPL[32]:7

the TLDR here is that type inference proves that e will be a DomainError, and so segfaults when it tries to load it.

JuliaLang / julia