vine: scheduling inefficiency b/c task resource pre-allocation usually fails

JinZhou5042 commented 1 week ago

In check_worker_against_task, we check if a task is able to run on a worker, this includes two essential steps:

Estimate the resources to be allocated for this task, in vine_manager_choose_resources_for_task. By default, we use a proportional technique, which chooses the max proportional cpu/memory/disk/gpu resources for the task.
Check if the chosen resources fit the actual worker, in check_worker_have_enough_resources. This compares the chosen cpu/memory/disk/gpu against the actual or current resource usage state on the worker.

However, the chosen resources are usually larger than the available resources, which results in the second step constantly failing until some tasks complete on the worker and release more resources. The first step tends to choose a larger portion of resources than the available, so the scheduling overhead seems to dominate the latency. I observed this by printing some debug messages in terminal.

To be more clear, I ran a small version of DV5 with the current version (without any change) and my test version (select a much smaller portion of resources in vine_manager_choose_resources_for_task), separately, the below shows their differences:

current version: the resource allocation is expensive and tasks run very sparse
test version: the resource allocation mostly succeed and task concurrency seems good

In total there are 8801 tasks, 99% of them finish in 10s. Factory configuration:

{
    "manager-name": "jzhou24-hgg2",
    "max-workers": 32,
    "min-workers": 16,
    "workers-per-cycle": 16,
    "condor-requirements":"((has_vast))",
    "cores": 4,
    "memory": 10*1024,
    "disk": 50*1024,
    "timeout": 36000
}

That said, I think there is a room to reduce the scheduling latency by combining a set of techniques:

Maybe allocate the resources for a task based on the available resources on a worker (combine the first and the second step)?
Track the global usable cores on all workers and one task can be considered only if there is at least one available core in the cluster.
Improve the proportional resource allocation.

JinZhou5042 commented 1 week ago

The checking fails mostly because the disk was allocated for too much.

btovar commented 1 week ago

@JinZhou5042 What do you mean? Without caching files and each task uses one core, then auto memory should be 2560 an disk 12800. If this is not the case, then this might be a bug. How much is it allocated per task so that the check fails? Is it because there are files already cached at the worker and the disk available for the proportion is less?

JinZhou5042 commented 1 week ago

I logged the resource allocation function from a run comprising ~8k tasks, the first figure shows that the manager had 458k attempts for task resource allocation, among them 96.4% failed because the task was allocated for more resources and thus was unable to be scheduled on that particular worker.

The second figure shows the three types of failure for those unsuccessful attempts. Surprisingly, every single failure comes along with a disk overallocation, meaning that the scheduler might allocate disk too aggressively. The number of core overallocation is equal to the number of memory overallocation, probably indicating a normal overallocation that the worker is busy with some other tasks thus it does not have extra free resources.

Attach the log here

One typical failure looks like this:

task_id: 861

allocated_cores: 1
available_cores: 1

allocated_memory: 2560
available_memory: 2560

allocated_disk: 12683
available_disk: 12459

The worker has 1 core, 2560 memory and 12459 disk available, however the scheduler allocates 1 core, 2560 memory and 12683 disk, with disk over allocating and the task has to be considered against another worker.

My hypothesis is that task outputs are not properly considered while allocating disk space.

JinZhou5042 commented 1 week ago

Two possible improvements here:

Track the total number of cores and one task can be considered only if there is at least one core usable
Maybe don't allocate too much disk?

cooperative-computing-lab / cctools

vine: scheduling inefficiency b/c task resource pre-allocation usually fails #3995