dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
899 stars 256 forks source link

Dask, GridSearchCV, XGBoost on GPU #443

Open jtromans opened 5 years ago

jtromans commented 5 years ago

I am having trouble finding the best practices for running GPU based XGBoost hyper-parameter searches using Dask (RandomizedSearchCV or GridSearchCV from dask-ml). Currently running Windows 7 (don't ask), 128GB RAM, 1x Titan Xp, dual Xeon with combined 16C/32T (8 cores each) with Python 3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32.

I use distributed as follows:

from dask.distributed import Client
scheduler_address = 'xxx.xxx.xx.xx:8786'  
client = Client(scheduler_address)

Example dataframe:

Index: 526829 entries, 2017-12-01 00:00:00 UTC to 2018-12-04 23:59:00 UTC
Columns: 601 entries, XXX to YYY
dtypes: float32(600), float64(1)
memory usage: 1.2+ GB

When using dask.distributed I understand that the GridSearchCV n_jobs parameter is ignored. It seems that jobs are distributed and processed based on the number of threads available. For example, if I run 2 local Worker processes, each with 1 thread, two task processing streams are created. Given that I'm developing on a single machine, and leverage only 1 GPU, I've struggled to find optimal performance and I was wondering whether developers would have an opinion.

Questions:

Sorry if this isn't the best place to ask these questions.

TomAugspurger commented 5 years ago

When using dask.distributed I understand that the GridSearchCV n_jobs parameter is ignored. It seems that jobs are distributed and processed based on the number of threads available.

I believe that's correct. GridSearchCV hooks into the distributed joblib backend, and I believe that the distributed joblib backend (currently developed in joblib itself) ignores the n_jobs parameter.

For example, if I run 2 local Worker processes, each with 1 thread, two task processing streams are created. Given that I'm developing on a single machine, and leverage only 1 GPU, I've struggled to find optimal performance and I was wondering whether developers would have an opinion.

XGB also have an n_jobs equivalent for CPU implementations, how should this be considering in the context of Dask as described above.

Are you referring to just xgboost, or dask-xgboost?

For dask-xgboost, I believe that option will be ignored, but I haven't tested that.

Consider XGBoost GPU: will Dask simply submit multiple XGBoost jobs to the single GPU, or is there some additional scheduling that tasks place?

dask-xgboost will start one XGBoost worker next to each Dask worker. For a single GPU / single node, I don't think you'll see any performance benefits over just using XGBoost directly.

FYI, https://github.com/mrocklin/dask-cuda may be helpful (helpful for us if you want to be a beta tester :)

Sorry if this isn't the best place to ask these questions.

This is a fine forum I think. Sorry if I don't have better answers though ;) Dask's interaction with GPUs should be improving though.

jtromans commented 5 years ago

Thanks for the response. I can confirm that after some careful bench-marking, I observe a significant speed-up running multiple processes parallel submitting jobs to the Titan Xp. I can confirm that this does not scale linearly, but is certainly worth perusing. I've monitored both the wall-clock across thousands of trials, plus the reported GPU utilization reported by Nvidia-smi tool. In both cases, the results reveal diminishing returns as I increase the number of processors (1 thread per process, usually). Nonetheless, each incremental worker adds additional performance. I tested up to 16 workers on a 16 thread machine running 1 thread per worker. In order to achieve this performance, all data must clearly fit in memory, which can become a lot when you have a significant number of workers.

TomAugspurger commented 5 years ago

Is there anything else to do here @jtromans?

I'm developing on a single machine, and leverage only 1 GPU, I've struggled to find optimal performance and I was wondering whether developers would have an opinion.

I missed the single-machine aspect here earlier. Is using scikit-learn / XGBoost directly an option for you? Each of those has their own single-machine parallelism frameworks. Adding dask may be complicating things unnecessarily.

jtromans commented 5 years ago

No you can close this.

For those interested, the reason I started with this basic set-up and Dask is because I wanted to get a basic set-up working before expanding.

I am now on a single machine running 4x 2080Ti and Ubuntu. I use the following style worker set-up in a bash-script to set-up the workers:

SERVER_IP='192.168.1.16:8786'
nohup dask-scheduler &
CUDA_VISIBLE_DEVICES=0 nohup dask-worker $SERVER_IP --nprocs=1 --nthreads=1 --memory-limit=2.5e+10 &
CUDA_VISIBLE_DEVICES=1 nohup dask-worker $SERVER_IP --nprocs=1 --nthreads=1 --memory-limit=2.5e+10 &
CUDA_VISIBLE_DEVICES=2 nohup dask-worker $SERVER_IP --nprocs=1 --nthreads=1 --memory-limit=2.5e+10 &
CUDA_VISIBLE_DEVICES=3 nohup dask-worker $SERVER_IP --nprocs=1 --nthreads=1 --memory-limit=2.5e+10 &

This means each dask-worker only sees a specific GPU and I can control how many tasks land on each GPU rather easily. I have explored overloading the XGB algorithm to run 2 or sometimes 3 jobs in parallel per GPU since I have available memory. This generally works well. I will look to set-up multiple machines, and I am interested in the performance difference and the optimal number of GPUs per machine given the available CPU/RAM/GPU combinations.