Citi / scaler

Efficient, lightweight and reliable distributed computation engine
Apache License 2.0
12 stars 4 forks source link

Scheduler incorrectly returns a task not found error #45

Open rafa-be opened 1 week ago

rafa-be commented 1 week ago

When a client immediately cancels a task it just submitted, the scheduler will sometimes return a TaskStatus.NotFound message, and log a error message such as:

[EROR]2024-11-20 22:47:58+0100: cannot find task_id=1ad3d9caa78911ef99be965046454255 in task workers

How to reproduce

The bug will systematically occur if the scheduler does not have any worker registered. It can also occur with registered workers, but with a lower probability.

with Client("tcp://127.0.0.1:60338") as client:
    client.submit(math.sqrt, 16).cancel()

Cause

The bug is in the implementation of the scheduler's task manager routine:

https://github.com/Citi/scaler/blob/f4b4daaa9aac218f710cb03ea7fd09fdd67660d7/scaler/scheduler/task_manager.py#L55-L67

The scheduler maintains two collections for tasks, _unassigned and _running.

As the assign_task_to_worker() might be blocking (if there is no worker), or yield to another async task, there is a possibility that the task being scheduled is removed from _unassigned but not yet added to _running. A TaskCancel message will thus raise a task not found error as the task is in neither of these collections.

Possible fix

Adding the task to _running before assigning it to a worker does not work, as it breaks task cancellation: the scheduler does not know to which worker to route the TaskCancel message.

Instead, the scheduler could in this order:

  1. acquire an available worker;
  2. wait on the task queue for the next task;
  3. immediately add the task to _running;
  4. send the Task object to the acquired worker.

This requires some refactoring of the scheduler.

rafa-be commented 4 days ago

The problem is due to the scheduler waiting on two blocking queues: the task queue and the worker queue.

@sharpener6 As I said, I'd like to make the worker queue non-blocking while implementing task tagging (#32). This will remove this bug, and will simplify task scheduling.

This requires a small behavior change: when no worker is connected to the scheduler, the client will receive a TaskStatus.NoWorker error, while currently the task is queued until a worker connects.