dabeaz / curio

Good Curio!
Other
4.01k stars 240 forks source link

Errors handling task cancellation should propagate to the task group #294

Closed Zaharid closed 4 years ago

Zaharid commented 5 years ago

The following code

import curio

async def to_cancel():
    try:
        await curio.sleep(10)
    except curio.TaskCancelled:
        1/0

async def main():
    async with curio.TaskGroup() as tg:
        await tg.spawn(to_cancel)
        await tg.cancel_remaining()

curio.run(main)
print("Success...?")

will print the ZeroDivisionError in the exception handler, but otherwise it will appear to exit successfully.

$ python xx.py
Task Crash: Task(id=3, name='to_cancel', state='TERMINATED')
Traceback (most recent call last):
  File "xx.py", line 5, in to_cancel
    await curio.sleep(10)
  File "/home/zah/anaconda3/lib/python3.7/site-packages/curio/task.py", line 600, in sleep
    return await _sleep(seconds, False)
  File "/home/zah/anaconda3/lib/python3.7/site-packages/curio/traps.py", line 82, in _sleep
    return (yield (_trap_sleep, clock, absolute))
curio.errors.TaskCancelled: TaskCancelled

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zah/anaconda3/lib/python3.7/site-packages/curio/kernel.py", line 747, in _run_coro
    trap = current._throw(current.next_exc)
  File "/home/zah/anaconda3/lib/python3.7/site-packages/curio/task.py", line 166, in _task_runner
    self.next_value = await coro
  File "xx.py", line 7, in to_cancel
    1/0
ZeroDivisionError: division by zero
Success...?

$ echo $status
0

I think the exception should be propagated in the implicit tg.join at the exit of the context manager, and from there all the way to curio.run.

dabeaz commented 5 years ago

This behavior is not so much a feature of TaskGroup, but of tasks more generally. If you want the final result of a task, you normally have to explicitly call Task.join() or access the value of Task.result. Cancellation is kind of strange though. If you're cancelling a task, it usually means that you're walking away from it and disregarding any final result in might produce. This currently includes any kind of abnormal termination via exception or programming error (such errors are currently ignored). In some sense, I suppose it's similar to Python's behavior of ignoring an exception if it gets raised inside the del() method of a class.

I'm not entirely sure how I would expect Curio to behave here. If an abnormal exception occurs as part of cancelling a task, does that propagate out of the cancel call? For something like cancel_remaining() which cancels a whole group of tasks, does an error in a single task cause it to prematurely stop or does it still guarantee cancellation of all tasks? If all tasks end abnormally, how does that get reported?

Admittedly, I'm not sure any of this provides an actual answer.

Zaharid commented 5 years ago

Personally I see the behaviour of del more as a limitation of the implementation/interface than as a desirable feature. Same goes for thread interfaces that disregard errors in spawned tasks. I believe that curio should guarantee that if there is an unhandled exception somewhere, it will eventually get bubbled out to the caller of run, so that a program with unhandled errors cannot exit successfully.

A possibility that occurs to me is that Kernel.run checks for unhandled exceptions before it is about to return and raises something if there are some (maybe some multierror thing), instead of the current behaviour of logging errors when they occur.

I'm not entirely sure how I would expect Curio to behave here. If an abnormal exception occurs as part of cancelling a task, does that propagate out of the cancel call? For something like cancel_remaining() which cancels a whole group of tasks, does an error in a single task cause it to prematurely stop or does it still guarantee cancellation of all tasks? If all tasks end abnormally, how does that get reported?

I would certainly expect an error in task.cancel to propagate out. For a TaskGroup, I would propose:

diff --git a/curio/task.py b/curio/task.py
index ae4cda8..6d32945 100644
--- a/curio/task.py
+++ b/curio/task.py
@@ -461,11 +461,16 @@ class TaskGroup(object):
         '''
         self._closed = True
         running = list(self._running)
+        cancel_failed = []
         for task in running:
             await task.cancel(blocking=False)
         for task in running:
             await task.wait()
             self._task_discard(task)
+            if task.next_exc and not isinstance(task.next_exc, TaskCancelled):
+                cancel_failed.append(task)
+        if cancel_failed:
+            raise TaskGroupError(cancel_failed)

which seems to pass the tests currently. Ideally with a better representation for TaskGroupError that shows where the actual problem(s) occurred by default.

njsmith commented 5 years ago

I believe that curio should guarantee that if there is an unhandled exception somewhere, it will eventually get bubbled out to the caller of run

I don't want to be rude so I'll be brief, and then butt out of the thread. But if this is an important guarantee for you, then you might want to be aware of Trio, where exactly what you wrote is a marquee feature.

Zaharid commented 5 years ago

I believe that curio should guarantee that if there is an unhandled exception somewhere, it will eventually get bubbled out to the caller of run

I don't want to be rude so I'll be brief, and then butt out of the thread. But if this is an important guarantee for you, then you might want to be aware of Trio, where exactly what you wrote is a marquee feature.

Thanks, I had already read that post with interest. In fact I like it so much that I think curio should do something similar; I see no reason why it can't, since AFAICT the main incompatibility between curio and trio is related to the precise meaning of await (cancel point vs interacting with the kernel), which is not so directly related with this issue.

I'd say that other than maybe needing to have some implicit scope for curio.spawn (which could be having run check for exceptions) there is no fundamental reason why curio cannot have structured concurrency.

dabeaz commented 5 years ago

There are a lot of things Curio could have.

I'm going to sleep on this for a bit. There are a few very specific ways to check on the outcome of a task in Curio. Cancellation is not one of them. I'm not sure that I want to attach some kind of new semantics to cancellation to do it either. Cancellation is cancellation. It's not cancellation plus some kind of extra checking thing. I'd consider a patch to have run() return with an exception is tasks are exiting with abnormal exceptions. That's a slightly different thing though.

As for structured concurrency, what is missing from TaskGroup specifically?

Zaharid commented 5 years ago

I'm going to sleep on this for a bit. There are a few very specific ways to check on the outcome of a task in Curio. Cancellation is not one of them. I'm not sure that I want to attach some kind of new semantics to cancellation to do it either. Cancellation is cancellation. It's not cancellation plus some kind of extra checking thing. I'd consider a patch to have run() return with an exception is tasks are exiting with abnormal exceptions. That's a slightly different thing though.

Could then the abnormal exception be raised in join / scope exit rather than cancel, as suggested originally?

As for structured concurrency, what is missing from TaskGroup specifically?

I'd argue this issue is one instance, in that it is possible to have code in abnormal state (failed cancellation) outside the TaskGroup scope and have it appear normal without error conditions.

njsmith commented 5 years ago

AFAICT the main incompatibility between curio and trio is related to the precise meaning of await (cancel point vs interacting with the kernel),

Historically yeah, that was the disagreement where I decided curio wasn't going to work for me, but since then the structured concurrency stuff has emerged as probably a larger differentiator.

I'd say that other than maybe needing to have some implicit scope for curio.spawn (which could be having run check for exceptions) there is no fundamental reason why curio cannot have structured concurrency.

IMO the key feature of structured concurrency is that when you call a function, you know that any tasks it spawns will be encapsulated within the function call. Implicit scopes break that. Now one could argue (and Dave has argued) that the flexibility afforded by curio.spawn is worth it and people can use it carefully, etc.; I don't want to barge in and get into that argument all over again :-). But the existence of curio.spawn is a major differentiator between Curio and Trio's approaches.

It sounds like this thread has also found some more subtle differences between how Curio taskgroups and Trio nurseries handle exceptions. For whatever it's worth, the way Trio handles these things is: the nursery always preserves all exceptions from child tasks, except for cancellation exceptions that were triggered by the nursery itself, and re-raises them together after all the nursery's children have exited.

dabeaz commented 5 years ago

I'd argue this issue is one instance, in that it is possible to have code in abnormal state (failed cancellation) outside the TaskGroup scope and have it appear normal without error conditions.

Curio TaskGroups already report exceptions that occur in the group. However, that's not what you're doing here. You're making an explicit call to cancel_remaining() which is doing exactly that--cancelling all remaining tasks and moving on. Yes, you've got a bug in your cancellation code and that's unfortunate. However, Curio wasn't silent about it. Maybe it could do more than log it. Maybe not. I'll look into it.

dabeaz commented 5 years ago

Looking at this, it's much easier to just have Task.cancel() and TaskGroup.cancel_remaining() re-raise any non-cancellation related exception that might have occurred in a task as a result of performing the cancel operation. I've pushed a patch. I'm still going to think about it and might take a different approach later. Let me know if it fixes the problem for you.

Zaharid commented 5 years ago

I'd say this is an improvement in that it handles well some common cases. I can see a couple of possibilities where exceptions are not bubbled: One is calling task.cancel(blocking=False) and the other in a TaskGroup if there is more than one error.

I don't think that all exceptions in cancel handlers are necessarily bugs. One example is to try to write work done so far to a file, which might fail in itself.

imrn commented 5 years ago

I think an async framework should not interfere with any of the exceptions. Even its internal exceptions should not be handled by itself.

dabeaz commented 5 years ago

I have reverted the change I made and plan to do nothing about this issue.

In pondering this, I got to thinking.... when was the last time I ever wanted a buggy signal handler in a process to abort a shell script that issued it a "kill" command? Or for a buggy signal handler to halt the entire operating system with a panic? Never.

Or when have I even written code where I wanted to check on the final status of a process after I explicitly killed it? Also, never.

Cancellation means cancellation. Tasks stop. That is the only guarantee Curio provides here. I don't much care how a Task stops whether it is by graceful shutdown or whether it gets run over by a bus after stepping into a crosswalk. If you actually care about the final state, then go look at the body afterwards.

There's often a tendency to make code try to cover every possible contingency in the name of safety or convenience. However, what often ends up happening instead is that you get this giant hairball of semantic edge cases that no-one can quite understand. Meanwhile, as your plane is quickly diving out of the sky towards the ground, you find that you don't have the time to go read the 2000 word description of the rationale for the hairball.

Curio cancel. Task no more, terminated. Check status elsewhere.

Hey, a Haiku! It fits in a tweet even. I digress.

Honestly, my main concern here is Curio being silent--it is not. If a Task dies in some unusual and/or spectacular way, it definitely gets logged. You could probably add some additional instrumentation around that using Curio's activation feature (https://curio.readthedocs.io/en/latest/reference.html#module-curio.activation). However, I'm not inclined to add any extra semantics to cancel() or TaskGroup for it.

imrn commented 5 years ago

There's often a tendency to make code try to cover every possible contingency in the name of safety or convenience. However, what often ends up happening instead is that you get this giant hairball of semantic edge cases that no-one can quite understand. Meanwhile, as your plane is quickly diving out of the sky towards the ground, you find that you don't have the time to go read the 2000 word description of the rationale for the hairball.

Definitely. Even a safety system may turn out to be the worst thing to mask an unsound design.

dabeaz commented 5 years ago

"Safety" wasn't my main motivation with Curio, but making it "understandable" is a high priority. That can mean many things I suppose, but I definitely think it should also include the actual implementation of Curio itself. If I were building a system where safety was a goal, I'd probably want to know how things work. There's also a pretty good chance I'd also take some time to audit code in critical parts of libraries that I was using. Things are a lot easier to understand if different concepts aren't mixed together.

I sometimes worry about coming across as needlessly "stubborn" on Curio changes/patches, but the connection/encapsulation of parts is something I think about a lot in this project. I don't always have time to work on Curio, but when I do have time to think about it, I'm mostly thinking about ways that I might remove features from it--not make it more complicated.