Open aulemahal opened 4 years ago
I use export OMP_NUM_THREADS=12
in my .bashrc
, which I would have thought limited my load to 1200% CPU. This was set up when I used Matlab and Julia.
However, I see from your numbers that I don't quite understand the relationship between # workers/# threads/OMP, so maybe I'm wrong.
since dask is also multithreaded/multiprocess it might cause some mutliplication of the omp setting?
Some guidance here maybe : https://docs.dask.org/en/latest/array-best-practices.html
Some guidance here maybe : https://docs.dask.org/en/latest/array-best-practices.html
Dask devs seem to suggest setting OMP_NUM_THREADS=1
I had 2 workers and 4 threads per worker during my tests, and I reached more than 3000% CPU on 2 python processes.
I would have to test it, but that could suggest that each worker might be allowed to reach thread x OMP
. That would mean 2 times 4800% CPU, which would fit my results?
Yes @RondeauG , I believe that this is what happens. The OMP_NUM_THREADS is per threads. So 4 dask threads with OMP=12, means a maximum of 48 threads!
Also, I expected a greater difference between 2 workers x4 and 1 worker x8... 1 worker does use (slightly) less memory, but the 2x4 seems somewhat faster.
Not totally sure if all of these apply to our servers/config but there are MKL and OPENBLAS settings as well in the dask 'best practice' guidance e.g. export OMP_NUM_THREADS=1 export MKL_NUM_THREADS=1 export OPENBLAS_NUM_THREADS=1
They might apply for other numpy processes, but griddata
and QHull underneath it seem to only depend on OpenMP. With dask I agree, setting all to 1. However, without, for smaller datasets, gridata's parallelism is far more efficient than dask's.
After a few tests on my end with xclim.sdba, I seem to get better results (and reasonable CPU use) with:
export OMP_NUM_THREADS=10
client = Client(n_workers=2, threads_per_worker=1, memory_limit="10GB")
rather than :
export OMP_NUM_THREADS=1
client = Client(n_workers=2, threads_per_worker=10, memory_limit="10GB")
My guess is that for non-dask-optimized tasks, OMP_NUM_THREADS=10
is more useful, therefore allowing the whole script to run faster?
That's what I understand. With EQM or DQM, the main bottleneck is the griddata calls that are apply_ufunc-wrapped. Underneath is qhull
. So the dask-parallelization is hard-coded by us while scipy is using highly-optimized external code. I will try to explain this in an upcoming sdba notebook...
Small update, for xclim.sdba
at least.
Having chunks alongside a single dimension (lat
, for example) gives me a better performance than alongside two dimensions (lat
and lon
). With only lat
, I get closer to that 1000% CPU more often.
Not an issue or even a real benchmarking, but I thought it would be interesting to share this comparison with xclim people.
Surtout : @RondeauG, @huard et @tlogan2000
I ran a basic DetrendedQuantileMapping adjustment on the infamous 'air_temperature' tutorial dataset of xarray (code below). I am using the DQM code in PR Ouranosinc/xclim#467, which is more efficient than what is on master right now.
Initially, I was trying this because Gabriel and I had a doubt about xclim/dask/numpy respecting the "OMP_NUM_THREADS" environment variable. Turns out the flag is respected, at least on my home setup... So the following are all the exact same calculations, but with different dask / OMP configurations. There are 8 cores on my laptop and the data size is around 7 MB.
Default dask (implicit 8 threads), OMP=1
Distributed dask, 1 worker, 8 threads, OMP=1
Distributed dask, 2 workers, 4 threads, OMP = 1
Distributed dask, 1 worker, 4 threads, OMP = 2
No Dask, OMP=8