More efficient task scheduler

fmaussion commented 5 years ago

Currently, the execute_entity_task takes a list of gdirs to apply one single task on them. The next task will start only if all previous gdirs have been worked through. This leads to sub optimal workload in the time between the end of the previous task and the start of the new one.

We deal with this on the cluster by sorting the glaciers by size before processing - so that the last glaciers to work on are quickly run through. This works quite efficiently in the case of many glaciers, but not so much when running on small sample sizes, i.e. 1 to 10 times the number of processors, which is frequent in test / development situations.

The solution to this problem (and removing the need for sorting) is to add a delayed option to the scheduler. With delayed=True, no computation will be done until a call to workflow.compute() is done. It the user's choice to say when the compute must take place (i.e. after all the tasks have been defined). The scheduler then knows that it can start with the next task while the previous one is still running on other glaciers. The only complexity will be that the scheduler will have to track the tasks on each glacier to make sure not to start a task too early on a glacier which is still computing.

While we are at it, we could look into Dask (http://docs.dask.org) if it can offer us some nice scheduling tools (mostly I am interested in their workload monitoring tool, not so much the rest which is overkill)

TimoRoth commented 5 years ago

As long as we can say for sure that no task will ever access data from another glacier, that can be done relatively easily.

TimoRoth commented 5 years ago

So the only way I really see to approach this is to make sure all tasks for the same glacier are sent to the same worker process. Otherwise it would require a complex inter-process dependency system which makes sure a task is only executed if its "parent task" is done, which is far from trivial.

I can't immediately see anything in Dask that would be useful here, given that we don't intend to make OGGM itself use it for computation.

fmaussion commented 5 years ago

Thanks for looking into this! But it's OK if it doesn't work out.

There are three use cases:

Large load on the cluster: e.g. 1000+ glaciers (full RGI Region) on 64 processors. Here the sorting of glaciers per area makes that the wait at the end of a task is minimal because the last small glaciers are going to be fast anyways
Small load on local machine: e.g. 4 glaciers on 4 processors. Here the intelligent schduler won't bring anything because the slowest of all 4 glaciers will be the bottleneck anyway, no matter how we schedule this
Medium load on local machine: e.g. 50 glaciers on 8 processors. Here the intelligent scheduler would be useful by keeping the load to 100% all the time.

So that's only one use case, not very useful.

The major reason I wanted to use dask is to have access to the shiny web interface :heart_eyes:. Here again, more a "nice to have" than really necessary.

fmaussion commented 5 years ago

OK, here is finally the right way to use dask in OGGM:

https://examples.dask.org/applications/embarrassingly-parallel.html

We will need this soon, either on the cloud or HPC, as dask-distributed will make our lives easier. cc @TimoRoth

fmaussion commented 5 years ago

https://distributed.dask.org/en/latest/ also works with SLURM for example

OGGM / oggm

More efficient task scheduler #661