Open jmichel-otb opened 2 years ago
@jmichel-otb thank you for this issue. I confirm I witnessed the very same puzzling behavior of dask in this case. Allowing users to provide some intel to the cost function would be a great addition for the remote sensing community! Cheers
Thank for for the thorough report. This is unfortunately a known issue we sometimes refer to as "root task overproduction".
While the task ordering (i.e. their priorities) is perfect, the cluster includes many other variables in deciding what to schedule when. At the most fundamental level, the problem arises since we only submit tasks to workers once we know the task can actually be computed, i.e. all dependencies are in memory. In your case, all A()
s can be computed since they don't have a dependency. Therefore, the workers build a queue of many A
. Once a batch finishes and a B
is ready, it will be assigned to a worker. While this is happening, the worker already started with a new batch of A
s. B
will not wait for this entire queue to be worked off, of course. it will cut in line but it will not abort currently executing A
s. Therefore, you'll compute at least threads_per_worker
too many A
s before the first reduction job B
starts.
There are principally two ways to fix this problem and many nuances of both flavours have been proposed over the time
The most promising fix so far is the second approach, see also https://github.com/dask/distributed/issues/3974. This is on our roadmap but I do not have an ETA for you, yet. However, we recently merged a pretty significant refactoring of our worker code which lays the groundwork for this https://github.com/dask/distributed/pull/5046
What happened:
This is a (very) simplified version of distributed scheduler failure that happens in CARS. Though the memory needed to compute the task graph is high and the memory resources are limited, there is a task order that guarantees the success of computation, but it is not found by the scheduler.
Minimal Complete Verifiable Example:
First, we start a distributed cluster with 2 workers, each with single thread and 400 Mo of RAM.
A() function generate numpy arrays of 90Mb
B() function sums all the array it receives and all the elements in it. It is only meant to represent a function that consume all the input data and output a single value (reduce type).
We generate 20 delayed A() tasks, and 5 B() tasks, each consuming 4 differents A() tasks. The final B() call is for graph visualization purpose only.
The task graph looks fine. The order is logical : consume 4 A() tasks and then depending B() task, before consuming next 4 A(). This order ensure maximal release of A() ressources.
Though our cluster has very limited ressources, task ordering in above graph would ensure sucess of computation.
But this is not what happens : the scheduler will try to perform all A() tasks first, which can not be achieved because it represents 1800 Mo of total memory and our cluster only has 800 Mo. This is shown in the dashboard graph bellow:
As a results, workers will restart a few time after reaching the memory limit treshold, and after that the future will be marked with error status.
The exception shows that the worker has been killed:
What you expected to happen: Of course, the scheduler can not be aware of the amount of memory A() and B() will generate. We could use bind to try to influence the tasks order, at the expanse of parallelism, and this practically amount to handling tasks ordering by hand ... There are several things that could be improved:
Anything else we need to know?:
The CARS problem is actually more complex than that:
Environment: