frequenz-floss / frequenz-sdk-python

Frequenz Python Software Development Kit (SDK)
https://frequenz-floss.github.io/frequenz-sdk-python/
MIT License
13 stars 17 forks source link

Most tasks swallow exceptions #826

Open llucax opened 9 months ago

llucax commented 9 months ago

What happened?

When there is an unhandled exception inside a task spawn by the SDK, the error is silently swallowed, leading to obscure, hard-to-debug bugs.

What did you expect instead?

Unhandled exceptions are at least logged or the whole Python process crashes.

Affected part(s)

Core components (data structures, etc.) (part:core)

Extra information

I discovered this while working at #806. I added a sanity check which failed inside a task that was sending messages to a channel, so no messages was sent, and other task waiting for messages just got stuck, leaving no clues about where the problem might be. This makes the problem really hard to debug.

This issue is actually a problem across all areas of the SDK and projects in general.

A way to cope with it in the SDK would be to extend BackgroundService to provide a create_task() method that automatically adds the task to the task list and then also adds a done callback where we can either log the unhandled exception, or just raise a SystemExit exception to exit the program. We could even give the user the option to decide how to handle unhandled exceptions by either passing a callback or letting them override the default callback as a method of the instance. This callback should also remove the task from the tasks list, something users need to do manually at the moment.

Related issues

The solution to this issue needs to have in mind the following related issues:

llucax commented 5 months ago

Maybe we can add a SDK specific create_task() that catches and logs unhandled exceptions, then use that function also in BackgroundService.

llucax commented 5 months ago

Oh, yeah, of course: https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.set_exception_handler

llucax commented 5 months ago

It is not trivial to override the default handler, because it will store the exception in the task object, and maybe users will inspect this task object in the future and report the exception properly in the future. Because of this, we might want to use a custom exception handler that only prints a debug message about unhandled exceptions, otherwise we might spam the logs with spurious messages about tasks stopping due to unhandled exceptions when exception are really properly handled, just not inside the task function itself, adding more noise to the logs and making things even harder to debug.

According to ChatGPT:

In Python's asyncio library, when a task raises an exception that is not caught within the task itself, the default behavior is somewhat subtle and important to understand:

  1. Exception Capturing: If an exception occurs within an asyncio Task and is not caught, the exception is captured and stored in the Task object. This captured exception will not be raised until the task's result is explicitly retrieved (e.g., using await task, task.result(), or task.exception()).

  2. Logging of Unretrieved Exceptions: If a Task object is destroyed (garbage collected) while still holding an exception that was neither retrieved nor consumed, asyncio logs a warning through the Python logging module. This warning contains the exception's traceback and indicates that the exception was never retrieved. The warning is logged to the asyncio logger with a level of ERROR.

  3. Event Loop Does Not Stop: Unlike some other frameworks that might halt an event loop on unhandled exceptions, asyncio's default behavior is to continue running the event loop even if a task fails with an exception.

So if this is correct and unless we are keeping dead task object around, we should actually get a log error about swallowed exceptions. We got unreported swallowed exceptions in tests, but maybe it is a pytest issue that they are not reported (or were kept alive in tests), as pytest replaces the event loop with a testing one AFAIK.