dask / dask-glm

BSD 3-Clause "New" or "Revised" License
75 stars 46 forks source link

Optimal chunksizes #32

Open mrocklin opened 7 years ago

mrocklin commented 7 years ago

In some cases we may wish to rechunk our data prior to execution. This can help to balance between high scheduling overheads (too many tasks) and poor load balancing (too few tasks).

It appears that different algorithms have different optimum sizes. For example algorithms with low task counts like ADMM benefit from smaller chunksizes while algorithms with many small tasks like gradient/proximal descent benefit from larger chunksizes.

stoneyv commented 7 years ago

Section 4.2 Cache-aware Access of Tianqi Chen and Carlos Guestrin's XGBoost paper discusses the tradeoff between smaller blocks to larger block sizes. Smaller blocks workloads and inefficient parallelization for each thread. Larger blocks that result in processor cache misses. They do this analysis for two data sets Allstatate 10M and Higgs 10M. Figure 9 plots time per thread versus number of threads for block sizes of 2^12, 2^16, 2^20, and 2^24. There is a significant difference between the performance of the 2^24 block size and the other block sizes. https://arxiv.org/pdf/1603.02754.pdf

Maybe we could duplicate this experiment for ADMM on those two data sets. They used S3 instead of HDFS.

Also, it is worth looking at Section 4.3 Blocks for Out-of-core computation which discuss block-compression and sharding. You might also look at the Blosc/c-blosc package that hdf5 uses. https://github.com/Blosc/c-blosc