Dask, GridSearchCV, XGBoost on GPU

jtromans commented 5 years ago

I am having trouble finding the best practices for running GPU based XGBoost hyper-parameter searches using Dask (RandomizedSearchCV or GridSearchCV from dask-ml). Currently running Windows 7 (don't ask), 128GB RAM, 1x Titan Xp, dual Xeon with combined 16C/32T (8 cores each) with Python 3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32.

I use distributed as follows:

from dask.distributed import Client
scheduler_address = 'xxx.xxx.xx.xx:8786'  
client = Client(scheduler_address)

Example dataframe:

Index: 526829 entries, 2017-12-01 00:00:00 UTC to 2018-12-04 23:59:00 UTC
Columns: 601 entries, XXX to YYY
dtypes: float32(600), float64(1)
memory usage: 1.2+ GB

When using dask.distributed I understand that the GridSearchCV n_jobs parameter is ignored. It seems that jobs are distributed and processed based on the number of threads available. For example, if I run 2 local Worker processes, each with 1 thread, two task processing streams are created. Given that I'm developing on a single machine, and leverage only 1 GPU, I've struggled to find optimal performance and I was wondering whether developers would have an opinion.

Questions:

Given that certain algorithms like XGB also have an n_jobs equivalent for CPU implementations, how should this be considering in the context of Dask as described above. For example, if you have 1 Worker process running 4 threads, presumably the XGB algorithm running with n_jobs=-1 will be limited to the 4 threads that the Dask Worker process exposes?
Consider XGBoost GPU: will Dask simply submit multiple XGBoost jobs to the single GPU, or is there some additional scheduling that tasks place?
Similarly, with a single GPU I often see significant performance differences when running n Worker processes with m threads each, compared to m Worker process with n threads. For example, 1 Worker with 4 threads took 155 seconds to complete the benchmark (fixed number of XGB fits) while 4 Workers with 1 thread each took 97 seconds. Both are faster than 1 Worker 1 Thread (282 seconds). Differences in performance make sense for CPU bound tasks, but I'm unclear of this cause for the GPU instance.

Sorry if this isn't the best place to ask these questions.

TomAugspurger commented 5 years ago

When using dask.distributed I understand that the GridSearchCV n_jobs parameter is ignored. It seems that jobs are distributed and processed based on the number of threads available.

I believe that's correct. GridSearchCV hooks into the distributed joblib backend, and I believe that the distributed joblib backend (currently developed in joblib itself) ignores the n_jobs parameter.

For example, if I run 2 local Worker processes, each with 1 thread, two task processing streams are created. Given that I'm developing on a single machine, and leverage only 1 GPU, I've struggled to find optimal performance and I was wondering whether developers would have an opinion.

XGB also have an n_jobs equivalent for CPU implementations, how should this be considering in the context of Dask as described above.

Are you referring to just xgboost, or dask-xgboost?

For dask-xgboost, I believe that option will be ignored, but I haven't tested that.

Consider XGBoost GPU: will Dask simply submit multiple XGBoost jobs to the single GPU, or is there some additional scheduling that tasks place?

dask-xgboost will start one XGBoost worker next to each Dask worker. For a single GPU / single node, I don't think you'll see any performance benefits over just using XGBoost directly.

FYI, https://github.com/mrocklin/dask-cuda may be helpful (helpful for us if you want to be a beta tester :)

Sorry if this isn't the best place to ask these questions.

This is a fine forum I think. Sorry if I don't have better answers though ;) Dask's interaction with GPUs should be improving though.

jtromans commented 5 years ago

Thanks for the response. I can confirm that after some careful bench-marking, I observe a significant speed-up running multiple processes parallel submitting jobs to the Titan Xp. I can confirm that this does not scale linearly, but is certainly worth perusing. I've monitored both the wall-clock across thousands of trials, plus the reported GPU utilization reported by Nvidia-smi tool. In both cases, the results reveal diminishing returns as I increase the number of processors (1 thread per process, usually). Nonetheless, each incremental worker adds additional performance. I tested up to 16 workers on a 16 thread machine running 1 thread per worker. In order to achieve this performance, all data must clearly fit in memory, which can become a lot when you have a significant number of workers.

TomAugspurger commented 5 years ago

Is there anything else to do here @jtromans?

I'm developing on a single machine, and leverage only 1 GPU, I've struggled to find optimal performance and I was wondering whether developers would have an opinion.

I missed the single-machine aspect here earlier. Is using scikit-learn / XGBoost directly an option for you? Each of those has their own single-machine parallelism frameworks. Adding dask may be complicating things unnecessarily.

jtromans commented 5 years ago

No you can close this.

For those interested, the reason I started with this basic set-up and Dask is because I wanted to get a basic set-up working before expanding.

I am now on a single machine running 4x 2080Ti and Ubuntu. I use the following style worker set-up in a bash-script to set-up the workers:

SERVER_IP='192.168.1.16:8786'
nohup dask-scheduler &
CUDA_VISIBLE_DEVICES=0 nohup dask-worker $SERVER_IP --nprocs=1 --nthreads=1 --memory-limit=2.5e+10 &
CUDA_VISIBLE_DEVICES=1 nohup dask-worker $SERVER_IP --nprocs=1 --nthreads=1 --memory-limit=2.5e+10 &
CUDA_VISIBLE_DEVICES=2 nohup dask-worker $SERVER_IP --nprocs=1 --nthreads=1 --memory-limit=2.5e+10 &
CUDA_VISIBLE_DEVICES=3 nohup dask-worker $SERVER_IP --nprocs=1 --nthreads=1 --memory-limit=2.5e+10 &

This means each dask-worker only sees a specific GPU and I can control how many tasks land on each GPU rather easily. I have explored overloading the XGB algorithm to run 2 or sometimes 3 jobs in parallel per GPU since I have available memory. This generally works well. I will look to set-up multiple machines, and I am interested in the performance difference and the optimal number of GPUs per machine given the available CPU/RAM/GPU combinations.

dask / dask-ml

Dask, GridSearchCV, XGBoost on GPU #443