Open rafa-be opened 1 week ago
The problem is due to the scheduler waiting on two blocking queues: the task queue and the worker queue.
@sharpener6 As I said, I'd like to make the worker queue non-blocking while implementing task tagging (#32). This will remove this bug, and will simplify task scheduling.
This requires a small behavior change: when no worker is connected to the scheduler, the client will receive a TaskStatus.NoWorker
error, while currently the task is queued until a worker connects.
When a client immediately cancels a task it just submitted, the scheduler will sometimes return a
TaskStatus.NotFound
message, and log a error message such as:How to reproduce
The bug will systematically occur if the scheduler does not have any worker registered. It can also occur with registered workers, but with a lower probability.
Cause
The bug is in the implementation of the scheduler's task manager routine:
https://github.com/Citi/scaler/blob/f4b4daaa9aac218f710cb03ea7fd09fdd67660d7/scaler/scheduler/task_manager.py#L55-L67
The scheduler maintains two collections for tasks,
_unassigned
and_running
.As the
assign_task_to_worker()
might be blocking (if there is no worker), or yield to another async task, there is a possibility that the task being scheduled is removed from_unassigned
but not yet added to_running
. ATaskCancel
message will thus raise a task not found error as the task is in neither of these collections.Possible fix
Adding the task to
_running
before assigning it to a worker does not work, as it breaks task cancellation: the scheduler does not know to which worker to route theTaskCancel
message.Instead, the scheduler could in this order:
_running
;Task
object to the acquired worker.This requires some refactoring of the scheduler.