cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
135 stars 120 forks source link

vine: scheduling inefficiency b/c task resource pre-allocation usually fails #3995

Open JinZhou5042 opened 1 week ago

JinZhou5042 commented 1 week ago

In check_worker_against_task, we check if a task is able to run on a worker, this includes two essential steps:

  1. Estimate the resources to be allocated for this task, in vine_manager_choose_resources_for_task. By default, we use a proportional technique, which chooses the max proportional cpu/memory/disk/gpu resources for the task.
  2. Check if the chosen resources fit the actual worker, in check_worker_have_enough_resources. This compares the chosen cpu/memory/disk/gpu against the actual or current resource usage state on the worker.

However, the chosen resources are usually larger than the available resources, which results in the second step constantly failing until some tasks complete on the worker and release more resources. The first step tends to choose a larger portion of resources than the available, so the scheduling overhead seems to dominate the latency. I observed this by printing some debug messages in terminal.

To be more clear, I ran a small version of DV5 with the current version (without any change) and my test version (select a much smaller portion of resources in vine_manager_choose_resources_for_task), separately, the below shows their differences:

In total there are 8801 tasks, 99% of them finish in 10s. Factory configuration:

{
    "manager-name": "jzhou24-hgg2",
    "max-workers": 32,
    "min-workers": 16,
    "workers-per-cycle": 16,
    "condor-requirements":"((has_vast))",
    "cores": 4,
    "memory": 10*1024,
    "disk": 50*1024,
    "timeout": 36000
}

That said, I think there is a room to reduce the scheduling latency by combining a set of techniques:

JinZhou5042 commented 1 week ago

The checking fails mostly because the disk was allocated for too much.

btovar commented 2 days ago

@JinZhou5042 What do you mean? Without caching files and each task uses one core, then auto memory should be 2560 an disk 12800. If this is not the case, then this might be a bug. How much is it allocated per task so that the check fails? Is it because there are files already cached at the worker and the disk available for the proportion is less?

JinZhou5042 commented 2 days ago

I logged the resource allocation function from a run comprising ~8k tasks, the first figure shows that the manager had 458k attempts for task resource allocation, among them 96.4% failed because the task was allocated for more resources and thus was unable to be scheduled on that particular worker.

image

The second figure shows the three types of failure for those unsuccessful attempts. Surprisingly, every single failure comes along with a disk overallocation, meaning that the scheduler might allocate disk too aggressively. The number of core overallocation is equal to the number of memory overallocation, probably indicating a normal overallocation that the worker is busy with some other tasks thus it does not have extra free resources.

image

Attach the log here

One typical failure looks like this:

task_id: 861

allocated_cores: 1
available_cores: 1

allocated_memory: 2560
available_memory: 2560

allocated_disk: 12683
available_disk: 12459

The worker has 1 core, 2560 memory and 12459 disk available, however the scheduler allocates 1 core, 2560 memory and 12683 disk, with disk over allocating and the task has to be considered against another worker.

My hypothesis is that task outputs are not properly considered while allocating disk space.

JinZhou5042 commented 2 days ago

Two possible improvements here: