xclim.sdba - Comparison for different multithreading configuration

aulemahal commented 4 years ago

Not an issue or even a real benchmarking, but I thought it would be interesting to share this comparison with xclim people.

Surtout : @RondeauG, @huard et @tlogan2000

I ran a basic DetrendedQuantileMapping adjustment on the infamous 'air_temperature' tutorial dataset of xarray (code below). I am using the DQM code in PR Ouranosinc/xclim#467, which is more efficient than what is on master right now.

Initially, I was trying this because Gabriel and I had a doubt about xclim/dask/numpy respecting the "OMP_NUM_THREADS" environment variable. Turns out the flag is respected, at least on my home setup... So the following are all the exact same calculations, but with different dask / OMP configurations. There are 8 cores on my laptop and the data size is around 7 MB.

Default dask (implicit 8 threads), OMP=1

Dask_default_8cpus_Omp1

Distributed dask, 1 worker, 8 threads, OMP=1

Dask_1wk_8th_Omp1

Distributed dask, 2 workers, 4 threads, OMP = 1

Dask_2wk_4th_Omp1

Distributed dask, 1 worker, 4 threads, OMP = 2

Dask_1wk_4th_Omp2

No Dask, OMP=8

NoDask_8cpus_Omp8

``` import xarray as xr from xclim.sdba.adjustment import DetrendedQuantileMapping from xclim.sdba.utils import * from xclim.sdba.processing import normalize from xclim.sdba.detrending import PolyDetrend from distributed import Client if __name__ == '__main__': client = Client(n_workers=2, threads_per_worker=4, memory_limit="4GB") print(client) ds = xr.tutorial.open_dataset('air_temperature', chunks={'time':-1, 'lat': 5, 'lon': 5}).resample(time='D').mean().load() ref = ds.air hist = fut = (ds.air * 1.1) print(ref.chunks, hist.chunks) DQM = DetrendedQuantileMapping(nquantiles=50, kind='+', group='time.month') DQM.train(ref, hist) DQM.ds.load() scen = DQM.adjust(fut, interp='linear', extrapolation='constant', detrend=1) print('done') scen.load() print(scen.sum().values) ```

RondeauG commented 4 years ago

I use export OMP_NUM_THREADS=12 in my .bashrc, which I would have thought limited my load to 1200% CPU. This was set up when I used Matlab and Julia.

However, I see from your numbers that I don't quite understand the relationship between # workers/# threads/OMP, so maybe I'm wrong.

tlogan2000 commented 4 years ago

since dask is also multithreaded/multiprocess it might cause some mutliplication of the omp setting?

tlogan2000 commented 4 years ago

Some guidance here maybe : https://docs.dask.org/en/latest/array-best-practices.html

tlogan2000 commented 4 years ago

Some guidance here maybe : https://docs.dask.org/en/latest/array-best-practices.html

Dask devs seem to suggest setting OMP_NUM_THREADS=1

RondeauG commented 4 years ago

I had 2 workers and 4 threads per worker during my tests, and I reached more than 3000% CPU on 2 python processes.

I would have to test it, but that could suggest that each worker might be allowed to reach thread x OMP. That would mean 2 times 4800% CPU, which would fit my results?

aulemahal commented 4 years ago

Yes @RondeauG , I believe that this is what happens. The OMP_NUM_THREADS is per threads. So 4 dask threads with OMP=12, means a maximum of 48 threads!

aulemahal commented 4 years ago

Also, I expected a greater difference between 2 workers x4 and 1 worker x8... 1 worker does use (slightly) less memory, but the 2x4 seems somewhat faster.

tlogan2000 commented 4 years ago

Not totally sure if all of these apply to our servers/config but there are MKL and OPENBLAS settings as well in the dask 'best practice' guidance e.g. export OMP_NUM_THREADS=1 export MKL_NUM_THREADS=1 export OPENBLAS_NUM_THREADS=1

aulemahal commented 4 years ago

They might apply for other numpy processes, but griddata and QHull underneath it seem to only depend on OpenMP. With dask I agree, setting all to 1. However, without, for smaller datasets, gridata's parallelism is far more efficient than dask's.

RondeauG commented 4 years ago

After a few tests on my end with xclim.sdba, I seem to get better results (and reasonable CPU use) with:

export OMP_NUM_THREADS=10 client = Client(n_workers=2, threads_per_worker=1, memory_limit="10GB")

rather than :

export OMP_NUM_THREADS=1 client = Client(n_workers=2, threads_per_worker=10, memory_limit="10GB")

My guess is that for non-dask-optimized tasks, OMP_NUM_THREADS=10 is more useful, therefore allowing the whole script to run faster?

aulemahal commented 4 years ago

That's what I understand. With EQM or DQM, the main bottleneck is the griddata calls that are apply_ufunc-wrapped. Underneath is qhull. So the dask-parallelization is hard-coded by us while scipy is using highly-optimized external code. I will try to explain this in an upcoming sdba notebook...

RondeauG commented 4 years ago

Small update, for xclim.sdba at least.

Having chunks alongside a single dimension (lat, for example) gives me a better performance than alongside two dimensions (lat and lon). With only lat, I get closer to that 1000% CPU more often.

Ouranosinc / xclim-benchmark