Kernel stuck in infinite loop after exception in trap

GeeTransit commented 5 years ago

Sorry for writing this using `we`. It makes this look like everyone encounters this problem when only I do :sweat_smile:

Problem:

If we run this code:

import curio

async def failing():
    # This can be anything that causes an exception inside the kernel.
    await curio.traps._kernel_trap(
        "roses are red, brackets mean list,"
        "this trap that i made doesnt exist."
    )

curio.run(failing)

It seems to never return or raise an exception. It just stops there and no traceback, no nothing :(

Simple Debug:

If we run this however:

import curio

async def failing():
    # This can be anything that causes an exception inside the kernel.
    await curio.traps._kernel_trap(
        "roses are red, brackets mean list, "
        "this trap that i made doesnt exist."
    )

And then go into a interactive session:

>>> kernel = curio.Kernel()
>>> kernel.run(failing)
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    kernel.run(failing)
  File "C:\Users\username\AppData\Local\Programs\Python\Python37-32\
lib\site-packages\curio\kernel.py", line 152, in run
    task = self._runner.send(coro)
  File "C:\Users\username\AppData\Local\Programs\Python\Python37-32\
lib\site-packages\curio\kernel.py", line 794, in _run_coro
    traps[trap[0]](*trap[1:])
KeyError: "roses are red, brackets mean list, this trap that i made doesnt exist."
>>> kernel.run(shutdown=True)  # This hangs forever...

Digging Deep:

Note: For this section, `task` is the implicit task that was created from `failing`.

I did some debugging and found that because task is running when the exception happened, task.cancel_func is still None. The code for _trap_cancel_task (in the kernel) checks task.allow_cancel and task.cancel_func to determine if the task can be cancelled.

The exception gets put into task.cancel_pending if it cannot be cancelled right now. When handling our failed task, it finds that task.allow_cancel is True, but task.cancel_func is still None as the kernel exits after the trap call fails (at traps[trap[0]](*trap[1:])).

In the kernel, task.cancel_func acts as a flag where None means that the task is ready / running and anything else means it's suspended. The kernel still thinks that the task is all fine and running along when it's actually stuck in a loophole. The task isn't on the ready queue because it was just running a few moments ago. Because you can't manually reschedule a task, nor can you cancel it, the task stays in an awkward place where you can't get it out without changing some internal state or putting the task back on the kernel's ready queue.

Possible Solutions:

We leave it as is. The user can just hope that someone doesn't hit Ctrl+C during the wrong time.
Apply another try-except around the task call that sets active.cancel_func to a dummy lambda so that _shutdown_tasks (called from Kernel.run(shutdown=True)) can still cancel the task.
Give some extra ability for _shutdown_tasks to forcibly cancel any disobedient tasks.
Cancel and reschedule the task manually after the trap call fails (but still raise the error). task.report_crash could be set to False as the failure was already raised. (This would effectively revert back to the behaviour that Curio 0.9 had but ignores the task's actual final exception.
Call task.coro.close() and hope it doesn't catch GeneratorExit.
Revert back to Curio 0.9's trap calling where exceptions are passed onto the task. (nooo there must be a way D:)

dabeaz commented 5 years ago

I'm not that concerned about the case where someone is passing bad data to a raw kernel-level trap (users should rarely, if ever, be using traps directly). There are no kernel-level functions that should be raising exceptions either (except in the case of a bug in the kernel itself). That said, will look into it further.

GeeTransit commented 5 years ago

It could be stuck there by pressing Ctrl+C and raising a KeyboardInterrupt too.

dabeaz commented 5 years ago

Modified the kernel to simply "drop" the current task entirely if there's an in-kernel exception. That might not sound clean, but any kind of exception occurring in-kernel isn't clean anyways. I don't want to give the impression that there's any sane/graceful way of handling it.

GeeTransit commented 5 years ago

Sounds good too :)

njsmith commented 5 years ago

Modified the kernel to simply "drop" the current task entirely if there's an in-kernel exception

Something we learned the hard way: this can cause havok if the task is supposed to be supervising children (https://github.com/python-trio/trio/issues/552#issuecomment-457879973). Curio's supervision model is different and you might not care anyway because this is such an edge case, but it caught me by surprise so I figured you might want a heads up.

dabeaz commented 5 years ago

Please don't post Trio links here. I don't post Curio stuff all over your issue tracker.

njsmith commented 5 years ago

Sorry about that, and thanks for letting me know it bothers you. (For the record: you're welcome to post curio-related links on the trio issue tracker if they're relevant. I do :-).)

dabeaz commented 5 years ago

Tweaked the modification slightly to not allow any further use of the kernel after a fatal crash. This basically makes the shutdown process a no-op.

dabeaz commented 5 years ago

Re: Trio links. I don't mean to be overly disagreeable or grumpy, but jumping in here to post links to a separate project feels a bit like hovering. Thanks for listening though.

dabeaz / curio