dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 715 forks source link

Reduce activity when CPU is under load #4053

Open mrocklin opened 4 years ago

mrocklin commented 4 years ago

Oversubscription of threads is a common problem, especially when tasks call routines that are themselves multi-threaded. The most common culprit of this is BLAS/LAPACK calls.

A couple of years ago I think that the Joblib folks did something clever, they polled the process to see if it was oversubscribed, and if it was, it slowed down work. We already check CPU metrics regularly with psutil. The Worker.ensure_computing method could choose to hold off on submitting new tasks if it notices that

  1. The local process' CPU load is already above where it should be.
  2. There is at least one task already running (we need to ensure that we don't get into a situation where we block all forward progress)

This has come up several times, but most recently came up in a post here: https://coiled.io/blog/bomb-detection-with-dask-and-machine-learning/

https://github.com/dask/distributed/blob/67a9a5963b757835d185c7b202f6895069934f97/distributed/worker.py#L2465-L2499

jakirkham commented 3 years ago

Might also be worth looking at threadpoolctl.

Edit: This is where scikit-learn devs refactored out all this threading logic.

hammer commented 3 years ago

I think that the Joblib folks did something clever

I went looking for what they did and found CPU over-subscription by joblib.Parallel due to BLAS which is an elaboration of https://github.com/joblib/joblib/issues/834, and which https://github.com/joblib/joblib/pull/940 claims to fix.