Open jtromans opened 5 years ago
When using dask.distributed I understand that the GridSearchCV n_jobs parameter is ignored. It seems that jobs are distributed and processed based on the number of threads available.
I believe that's correct. GridSearchCV hooks into the distributed joblib backend, and I believe that the distributed joblib backend (currently developed in joblib itself) ignores the n_jobs
parameter.
For example, if I run 2 local Worker processes, each with 1 thread, two task processing streams are created. Given that I'm developing on a single machine, and leverage only 1 GPU, I've struggled to find optimal performance and I was wondering whether developers would have an opinion.
XGB also have an n_jobs equivalent for CPU implementations, how should this be considering in the context of Dask as described above.
Are you referring to just xgboost, or dask-xgboost?
For dask-xgboost, I believe that option will be ignored, but I haven't tested that.
Consider XGBoost GPU: will Dask simply submit multiple XGBoost jobs to the single GPU, or is there some additional scheduling that tasks place?
dask-xgboost will start one XGBoost worker next to each Dask worker. For a single GPU / single node, I don't think you'll see any performance benefits over just using XGBoost directly.
FYI, https://github.com/mrocklin/dask-cuda may be helpful (helpful for us if you want to be a beta tester :)
Sorry if this isn't the best place to ask these questions.
This is a fine forum I think. Sorry if I don't have better answers though ;) Dask's interaction with GPUs should be improving though.
Thanks for the response. I can confirm that after some careful bench-marking, I observe a significant speed-up running multiple processes parallel submitting jobs to the Titan Xp. I can confirm that this does not scale linearly, but is certainly worth perusing. I've monitored both the wall-clock across thousands of trials, plus the reported GPU utilization reported by Nvidia-smi tool. In both cases, the results reveal diminishing returns as I increase the number of processors (1 thread per process, usually). Nonetheless, each incremental worker adds additional performance. I tested up to 16 workers on a 16 thread machine running 1 thread per worker. In order to achieve this performance, all data must clearly fit in memory, which can become a lot when you have a significant number of workers.
Is there anything else to do here @jtromans?
I'm developing on a single machine, and leverage only 1 GPU, I've struggled to find optimal performance and I was wondering whether developers would have an opinion.
I missed the single-machine aspect here earlier. Is using scikit-learn / XGBoost directly an option for you? Each of those has their own single-machine parallelism frameworks. Adding dask may be complicating things unnecessarily.
No you can close this.
For those interested, the reason I started with this basic set-up and Dask is because I wanted to get a basic set-up working before expanding.
I am now on a single machine running 4x 2080Ti and Ubuntu. I use the following style worker set-up in a bash-script to set-up the workers:
SERVER_IP='192.168.1.16:8786'
nohup dask-scheduler &
CUDA_VISIBLE_DEVICES=0 nohup dask-worker $SERVER_IP --nprocs=1 --nthreads=1 --memory-limit=2.5e+10 &
CUDA_VISIBLE_DEVICES=1 nohup dask-worker $SERVER_IP --nprocs=1 --nthreads=1 --memory-limit=2.5e+10 &
CUDA_VISIBLE_DEVICES=2 nohup dask-worker $SERVER_IP --nprocs=1 --nthreads=1 --memory-limit=2.5e+10 &
CUDA_VISIBLE_DEVICES=3 nohup dask-worker $SERVER_IP --nprocs=1 --nthreads=1 --memory-limit=2.5e+10 &
This means each dask-worker only sees a specific GPU and I can control how many tasks land on each GPU rather easily. I have explored overloading the XGB algorithm to run 2 or sometimes 3 jobs in parallel per GPU since I have available memory. This generally works well. I will look to set-up multiple machines, and I am interested in the performance difference and the optimal number of GPUs per machine given the available CPU/RAM/GPU combinations.
I am having trouble finding the best practices for running GPU based XGBoost hyper-parameter searches using Dask (RandomizedSearchCV or GridSearchCV from dask-ml). Currently running Windows 7 (don't ask), 128GB RAM, 1x Titan Xp, dual Xeon with combined 16C/32T (8 cores each) with
Python 3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
.I use distributed as follows:
Example dataframe:
When using dask.distributed I understand that the GridSearchCV n_jobs parameter is ignored. It seems that jobs are distributed and processed based on the number of threads available. For example, if I run 2 local Worker processes, each with 1 thread, two task processing streams are created. Given that I'm developing on a single machine, and leverage only 1 GPU, I've struggled to find optimal performance and I was wondering whether developers would have an opinion.
Questions:
Sorry if this isn't the best place to ask these questions.