Open JinZhou5042 opened 1 week ago
The checking fails mostly because the disk was allocated for too much.
@JinZhou5042 What do you mean? Without caching files and each task uses one core, then auto memory should be 2560 an disk 12800. If this is not the case, then this might be a bug. How much is it allocated per task so that the check fails? Is it because there are files already cached at the worker and the disk available for the proportion is less?
I logged the resource allocation function from a run comprising ~8k tasks, the first figure shows that the manager had 458k attempts for task resource allocation, among them 96.4% failed because the task was allocated for more resources and thus was unable to be scheduled on that particular worker.
The second figure shows the three types of failure for those unsuccessful attempts. Surprisingly, every single failure comes along with a disk overallocation, meaning that the scheduler might allocate disk too aggressively. The number of core overallocation is equal to the number of memory overallocation, probably indicating a normal overallocation that the worker is busy with some other tasks thus it does not have extra free resources.
Attach the log here
One typical failure looks like this:
task_id: 861
allocated_cores: 1
available_cores: 1
allocated_memory: 2560
available_memory: 2560
allocated_disk: 12683
available_disk: 12459
The worker has 1 core, 2560 memory and 12459 disk available, however the scheduler allocates 1 core, 2560 memory and 12683 disk, with disk over allocating and the task has to be considered against another worker.
My hypothesis is that task outputs are not properly considered while allocating disk space.
Two possible improvements here:
In
check_worker_against_task
, we check if a task is able to run on a worker, this includes two essential steps:vine_manager_choose_resources_for_task
. By default, we use a proportional technique, which chooses the max proportional cpu/memory/disk/gpu resources for the task.check_worker_have_enough_resources
. This compares the chosen cpu/memory/disk/gpu against the actual or current resource usage state on the worker.However, the chosen resources are usually larger than the available resources, which results in the second step constantly failing until some tasks complete on the worker and release more resources. The first step tends to choose a larger portion of resources than the available, so the scheduling overhead seems to dominate the latency. I observed this by printing some debug messages in terminal.
To be more clear, I ran a small version of DV5 with the current version (without any change) and my test version (select a much smaller portion of resources in
vine_manager_choose_resources_for_task
), separately, the below shows their differences:current version: the resource allocation is expensive and tasks run very sparse
test version: the resource allocation mostly succeed and task concurrency seems good
In total there are 8801 tasks, 99% of them finish in 10s. Factory configuration:
That said, I think there is a room to reduce the scheduling latency by combining a set of techniques: