Open Keno opened 1 year ago
This was originally a slack thread, so I missed some prior discussion, but cancellation was also extensively discussed in https://github.com/JuliaLang/julia/issues/33248.
@gbaraldi raised the question of what to do on allocation failure. Copying my response here:
I think we need to separate allocation failure into two separate things.
I think it's fine to have an explicit, synchronous exception for the former. I think the latter just needs to freeze the task - potentially with some notification to an oom monitor. We just do way too much explicit allocation to have this turn into an exception. The optimizer should also have liberty to turn the former into the latter (if it can prove the allocation is small, or if it knows the allocation can be elided entirely regardless of size).
@gbaraldi also points out that I should link #49541, although that proposal is mostly orthogonal, because it's about how to figure out what to cancel on ^C, but not necessarily how to cancel it.
- A return to the REPL on a new task/thread.
Does it mean that Ctrl-C would not actually interrupt the process, but only "hide" that it keeps running? If the answer is yes, than that seems like a huge footgun, since the "interrupted" function would still keep running in background potentially overriding some mutable state (vectors, matrices) while user trying to save/inspect them
or at least save any in progress work they may have
There's very little things more frustrating than losing your workspace state
It would be even more frustrating to realise that the "interrupted" process actually has not been interrupted and keeps changing the workspace state in the background
If the answer is yes, than that seems like a huge footgun, since the "interrupted" function would still keep running in background potentially overriding some mutable state (vectors, matrices) while user trying to save/inspect them
Yes, but this is the "emergency fallback mode", and would have to be appropriately messaged to the user. I'm thinking red flashing REPL prompt or something with a big red warning message.
It would be even more frustrating to realise that the "interrupted" process actually has not been interrupted and keeps changing the workspace state in the background
I think it would be fine to suspend the hung thread by default with an option to resume it explicitly using some command. Again 99% of users are not expected to ever hit this state. It's supposed to improve the situation where current you're getting crashes, segfaults and arbitrary memory writes on ^C.
Yes, but this is the "emergency fallback mode", and would have to be appropriately messaged to the user. I'm thinking red flashing REPL prompt or something with a big red warning message.
Perhaps the "emergency fallback mode" should use a different combination instead of Ctrl-C. SIGINT has a clear semantic of interrupting the process, and it doesn't necessarily imply that the user should regain control at any cost. It might be counterintuitive to let the user "interrupt" the process and then inform them that it hasn't actually been interrupted with flashy messages. Instead of abruptly exiting after multiple Ctrl-C presses, Julia could display a message suggesting a different keyboard combination for entering the "emergency fallback mode."
I think it would be fine to suspend the hung thread by default with an option to resume it
It sounds like a good idea, but it's not what I would expect from the intended function of Ctrl-C. There are specific signals (SIGSTOP
and SIGCONT
) designed for explicit stopping and resuming.
This is mostly about the behavior in the REPL. In that instance, Julia is taking on job control responsibilities, and there's no reason to expect to require that to match POSIX job control semantics. That said, I think the idea of separating cancellation and suspension is reasonable. We'd have to play with it and see what people like best.
avoid StackOverflowErrors entirely
Nathaniel J Smith of "go
statement considered harmful" fame has written about ctrl-c and cancellation in his Trio structured-concurrency library:
note that this actually can cause segfaults currently:
julia> using Random
julia> function h(rng)
try
while true
rand(rng) < 0 && throw(DomainError("bad"))
end
catch e
print(e)
end
end
h (generic function with 9 methods)
julia> h(Xoshiro())
^C^C^C^C^CWARNING: Force throwing a SIGINT
DomainError(#undef,
[519449] signal 11 (128): Segmentation fault
in expression starting at REPL[33]:1
typekeyvalue_hash at /home/oscardssmith/julia/src/jltypes.c:1828 [inlined]
lookup_typevalue at /home/oscardssmith/julia/src/jltypes.c:1136
jl_inst_arg_tuple_type at /home/oscardssmith/julia/src/jltypes.c:2479
arg_type_tuple at /home/oscardssmith/julia/src/gf.c:2388 [inlined]
jl_lookup_generic_ at /home/oscardssmith/julia/src/gf.c:3432 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3490
_show_default at ./show.jl:504
show_default at ./show.jl:487 [inlined]
show at ./show.jl:482 [inlined]
print at ./strings/io.jl:35
unknown function (ip: 0x7f81fc8e2b76) at (unknown file)
_jl_invoke at /home/oscardssmith/julia/src/gf.c:3306 [inlined]
ijl_apply_generic at /home/oscardssmith/julia/src/gf.c:3494
print at ./coreio.jl:3
h at ./REPL[32]:7
the TLDR here is that type inference proves that e
will be a DomainError
, and so segfaults when it tries to load it.
Following up on recent discussions caused by our improvements to the modeling of exception propagation, I've been thinking again about the semantics of asynchronous exceptions (in particular StackOverflowError and InterruptException). This is of course not a new discussion. See e.g. #4037 #7026 #15514. However, I because of our recent improvements to exception flow modeling, I think this issue has gained new urgency.
Current situation
Before jumping into some discussion about potential enhancements, let me summarize some relevant history and our current state. Let me know if I forgot anything and I'll edit it in.
InterruptException
By default, we defer interrupt exceptions to the next GC safepoint. This helps avoid corruption caused by unwinding over state in C that isn't interrupt safe. This helps a bit, but of course, if you are actually in this kind of region, your experience will be something like the following:
Which isn't any better than before (because we're falling back to what we used to do) and arguably worse (because you had to press ^C many times).
StackOverflowError
Stack overflow error just exposes the OS notion of stack overflow (i.e. if something touches the guard page, the OS sends us a SEGV, which we turn into an appropriate Julia error). We are slightly better here than we used to be since we now at least stack probe large allocations (https://github.com/JuliaLang/julia/pull/40068).
Nevertheless, this again is still not particularly well defined semantically. For example, is the following actually semantically sound:
https://github.com/JuliaLang/julia/blob/187e8c2222878c68b2afc9295ab8dc61773bd7f2/base/strings/io.jl#L32-L40
I think the answer is probably "no", because setting up the exception from touches the stack, so we could be generating a stackoverflowerror after the lock, but before we enter the try/finally region. Additionally, if we are close enough to the stack bottom to cause a stack overflow, there's no guarantee that we won't immediately hit that same stack overflow again trying to run the unlock code.
Recent try/finally elision
On master, inference has the capability to reason about the type of exceptions and whether or not catch blocks run. As a result, we can end up eliding try/finally blocks if everything inside the try block is proven nothrow:
As a result, we can get the following behavior:
i.e. the finally block never ran.
Some thoughts on how to move forward
I don't think I really have answers, but here's some scattered thoughts:
A possible design for cancellation
I think the general consensus among language and API designers is that arbitrary cancellation is unworkable as an interface. Instead, one should favor explicit cancellation requests and cancellation checks. In that vein, we could consider having an explicit
@cancel_check
macro that expands to:For more complex use cases
cancellation_requested
could be called directly and additional cleanup (e.g. requesting the cancellation of any synchronously launched I/O operations or something). As an additional, optimization, we can take advantage of our (recently much improved):effect_free
modeling to add the ability to reset to the previous (by longjmp)cancellation_requested
check if there have been no intervening side effects. This extension could then also be used by external C libraries to register their own cancellation mechanism, in effect giving us back some variant of scoped asynchronous cancellation, but only when semantically unobservable or explicitly opted into.That of course leaves the question of what would happen if there is no cancellation point set. My preference here would be to wait a reasonable amount of time (a few seconds or so, bypassable by a second press of ^C) and if no cancellation point is reached in time,
This way, we never throw any unsafe asynchronous exceptions that could be corrupting the process state, but give the user back a REPL that they can use to either investigate the problem, or at least save any in progress work they may have. There's very little things more frustrating than losing your workspace state, because the ^C you did happened to corrupt and crash your process.
One final note here is to ask the question what should happen while we're in inference or LLVM. Since they are not modeled, we are not semantically supposed to throw any InterruptExceptions here. With the design above, the answer here would be that on entry, we would stop infering/compiling things, instead proceeding in the interpreter in the hope to hit the next cancellation point as quickly as possible. If cancellation becomes active while we are compiling, we would try to bail out as soon as feasible.
My recommendation
Having written all this down, I think my preference would be a combination of the above cancellation proposal with some mechanism to avoid StackOverflowErrors entirely. I think to start with I think we could enable some sort of segmented task stack support, but treat triggering this as an error to be thrown at the next cancellation point. I think we should also investigate if we can more fully model a function's stack size requirements since we tend to be more aggressively devirtualized. If we can, then, we could consider using a segmented stack mechanism more widely, but I think even if there is some performance penalty, getting rid of the possibility of asynchronous exceptions is well worth it.