Open larsoner opened 1 year ago
@larsoner Thank you for open this issue. Could you provide a minimal reproducible example that we can run and check? Is this problem still present in the most recent version of dask and distributed?
Could you provide a minimal reproducible example that we can run and check?
Yes one is in the top comment, I just collapsed it with a <details>
tag in my post
Is this problem still present in the most recent version of dask and distributed?
It looks like it is indeed fixed on 2022.10.0!
Maybe still worth leaving open as a DOC update to say what gets set by the threads_per_worker
somewhere in case this crops up again somehow/somewhere? If not, feel free to close.
Yes one is in the top comment, I just collapsed it with a
<details>
tag in my post
Whoops, I missed this.
Is this problem still present in the most recent version of dask and distributed? It looks like it is indeed fixed on 2022.10.0! Maybe still worth leaving open as a DOC update to say what gets set by the
threads_per_worker
somewhere in case this crops up again somehow/somewhere? If not, feel free to close.
I see we have documented this in LocalCluster. See https://distributed.dask.org/en/stable/api.html?highlight=LocalCluster#distributed.LocalCluster where it says:
threads_per_worker: int
Number of threads per each worker
Is this what you were looking for, or do you have anything else in mind?
Is this what you were looking for, or do you have anything else in mind?
Just more detail about how/what it actually sets so that the scope of the thread control can be inferred. For example, if it's documented (somewhere) what this parameter controls, in the future if someone sees CPU oversubscription, they can better determine if:
dask
but isn't (e.g., if dask
iterates over threadpoolctl
-detected endpoints and sets them all, but something threadpoolctl
-able is observed not to be limited)dask
, but maybe it should (e.g., if dask
uses some potentially incomplete inspection/setting method that could be extended to support something else)dask
, and shouldn't be (e.g., some non-OpenMP/custom threading code in a some C++ code called from Python)@jrbourbeau Do you know if we have more documentation about this? I was not able to find it but I might be missing something.
Thanks @larsoner. I think updating the threads_per_worker
description to be
threads_per_worker: int
The number of threads in each worker's ``ThreadPoolExecutor`` where tasks are executed.
would make things more clear.
Do you know if we have more documentation about this?
We have this best practice on avoiding oversubscribing threads https://docs.dask.org/en/stable/array-best-practices.html#avoid-oversubscribing-threads.
@ncclementi are you interested in updating the LocalCluster
docstring?
I can do it and maybe add some cross refs to the best practices in the places I looked for information
@larsoner go for it, feel free to ping me for review.
Describe the issue:
Dask
Client
withthreads_per_worker=1
does not limit all thread types, so it can be violated for example by sklearn FastICA via OpenMP.Minimal Complete Verifiable Example:
On my macOS M1 machine (with 10 CPUs) I get:
Which is driven by oversubscription of CPUs -- 30 threads get spawned because
n_workers=3
and no limit is set on the OpenMP number of threads.However, uncommenting the
threadpoolctl.threadpool_limits(limits=1, user_api='openmp')
line above within thefit_ica
function, I get much better timings:Anything else we need to know?:
I understand that a solution to
threads_per_worker) might be (endlessly) complicated -- although maybe some iteration over
threadpoolctl`-discovered interfaces could be used? -- so in lieu of that, maybe docs could be updated to give people hints. I'm thinking specifically:threads_per_worker
actually controls/sets and how, e.g. in LocalCluster.threads_per_worker
docstring to some separate doc/FAQ that talk about OMP_NUM_THREADS, MKL_NUM_THREADS, threadpoolctl, etc. methods of controlling execution.Apologies if these are already documented somewhere and I missed it!
Environment: