Multi-GPU Does not Work with Cuda 10.0

edwin-yan commented 5 years ago

I initially installed xgboost using https://s3-us-west-2.amazonaws.com/xgboost-wheels/xgboost-multigpu-0.80-py2.py3-none-manylinux1_x86_64.whl. Then, I found only 1 gpu is working even after I set n_gpu=-1 (or 2 or 3 or 4). Therefore, I tried to build it from source. It does not seem to have any issues during the installation, however, it shows the error as below when I call the train function (or fit function using sklearn style, same error). I did see there are 2 tasks appears on nvtop, so the multi-gpu did work, but I cannot find out why it will cause the problem. Do you think this is solely caused by Cuda 10?

Thank you!

Error message like blow: XGBoostError: b'[07:34:49] /home/yan/xgboost/src/tree/updater_gpu_hist.cu:908: Exception in gpu_hist: [07:34:49] /home/yan/xgboost/src/common/host_devicevector.cu:357: Check failed: distribution.IsEmpty() || distribution.IsEmpty() \n\nStack trace returned 10 entries:\n[bt] (0) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(dmlc::StackTrace[abi:cxx11]()+0x45) [0x7f6357ece2b5]\n[bt] (1) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x7f6357eced18]\n[bt] (2) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(xgboost::HostDeviceVectorImpl<xgboost::detail::GradientPairInternal >::Reshard(xgboost::GPUDistribution const&)+0x88) [0x7f63580eb5f8]\n[bt] (3) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(xgboost::tree::GPUHistMaker::InitData(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix)+0x671) [0x7f635815df91]\n[bt] (4) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(xgboost::tree::GPUHistMaker::UpdateTree(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, xgboost::RegTree)+0xce) [0x7f635816402e]\n[bt] (5) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree, std::allocator<xgboost::RegTree> > const&)+0x17a) [0x7f6358166e5a]\n[bt] (6) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete > > >)+0x9cb) [0x7f6357f5194b]\n[bt] (7) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction)+0x3e3) [0x7f6357f524a3]\n[bt] (8) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix)+0x354) [0x7f6357f5f5c4]\n[bt] (9) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x35) [0x7f6357ec1965]\n\n\n\n\nStack trace returned 10 entries:\n[bt] (0) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(dmlc::StackTrace[abi:cxx11]()+0x45) [0x7f6357ece2b5]\n[bt] (1) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x7f6357eced18]\n[bt] (2) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree, std::allocator<xgboost::RegTree> > const&)+0x332) [0x7f6358167012]\n[bt] (3) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete > > >)+0x9cb) [0x7f6357f5194b]\n[bt] (4) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction)+0x3e3) [0x7f6357f524a3]\n[bt] (5) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x354) [0x7f6357f5f5c4]\n[bt] (6) /usr/local/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x35) [0x7f6357ec1965]\n[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f643736c038]\n[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x32a) [0x7f643736ba9a]\n[bt] (9) /usr/local/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ae) [0x7f643757fe6e]\n\n'

hcho3 commented 5 years ago

The binary wheel was built for CUDA 8.0 / 9.x. Not sure if it's compatible with CUDA 10.0, since it just came out this month.

edwin-yan commented 5 years ago

I am just wondering why it works (although with only 1 gpu) if I install the wheel directly. when I install it through binary, it does not even work (although it seems all gpus got the task). Maybe it has something to do with the NCCL too(one for 9.x one for 10.x). Anyhow, I guess I have to roll back to Cuda 9.2 then. Thanks!!

hcho3 commented 5 years ago

When building from source, did you have NCCL 2 installed for CUDA 10.0?

trivialfis commented 5 years ago

@Bluesn0w Currently GPUPredictor only works on single GPU, but other parts can work on multi-gpu. During update, XGBoost first do a prediction then try to find split. So it takes a while to enter the "multi-gpu" phase. As for the backtrace you saw, my guess is the same bug as #3809.

edwin-yan commented 5 years ago

@hcho3, yes I did. I also reference the root folder of nccl2 for Cuda 10 in the cmake process. I don’t think I saw any errors in the building process, so I initially thought it may work

edwin-yan commented 5 years ago

@trivialfis, thanks! It did look the same. However, it seems that I cannot even run train/fit somehow. I am going to roll back to Cuda 9.2 tonight and give a try.

trivialfis commented 5 years ago

@Bluesn0w Currently just set n_gpus to -1, I don't think there is really an issue blocking gpu_predictor from working on multi-gpu. Sorry for the mess. At the very least #3809 is on blocking list now.

hcho3 commented 5 years ago

Multi-GPU predictor is now merged. I will add CUDA 10.0 into the test suite to ensure compatibility.

edwin-yan commented 5 years ago

Thank you! That’s exciting. I have been using the precomplies wheel, but it only uses 1 GPU even I installed the multi gpu version. My NCCL is 2.3 for Cuda 10.0. Tested with Nvidia NCCL-rest, all works out. Look forward to hearing more.

hcho3 commented 5 years ago

Can you try to compile from source with USE_NCCL=1?

hcho3 commented 5 years ago

@Bluesn0w I install the binary wheel xgboost-multigpu-0.80-py2.py3-none-manylinux1_x86_64.whl on my machine (EC2 instance, type P2.8xlarge) and ran the following script:

import xgboost
from sklearn.datasets import load_breast_cancer

param = {'max_depth': 6, 'eta': 1, 'silent': 1,
         'objective': 'binary:logistic',
         'tree_method': 'gpu_hist',
         'n_gpus': -1, 'silent': 0,
         'eta': 0.01, 'eval_metric':'auc'}

X, y = load_breast_cancer(return_X_y=True)
dtrain = xgboost.DMatrix(X, y)

num_round = 1000
bst = xgboost.train(param, dtrain, num_round, [(dtrain, 'train')])

The script did not crash, and all of the 8 GPUs were used (utilization rate 40-50%).

I also tried compiling from the latest source and got the same result (no crash, all GPUs used).

Description of my machine:

EC2 instance, type P2.8xlarge
8 NVIDIA K80 GPUs
CUDA version 10.0
NCCL version 2.3.5 for CUDA 10.0

edwin-yan commented 5 years ago

@hcho3 Thank you so much for your help. I started over with a new VM with the exactly the same image, installed the Cuda driver, NCCL , CuDNN, then install the wheel again, the it works now with multiple GPU. I cannot figure out what package is causing it only use 1 GPU with the prebuilt wheel. I guess I should try to install on the new environment earlier.

Just for my curiosity, I tried to compile the latest source in the brand new VM to see if I can replicate the crush. surprisingly, it worked well too! However, if I use the exactly the same steps in my old VM, it just crush when I run xgboost.train.

I am going to install some packages I had to the new VM slowly (eg: change default python3 from 3.5 to 3.6, add virtualenv and etc.) and see if I can find out what cause the issue. I will keep you all posted if i can replicate the same thing on new VM.

Please feel free to close this and I really appreciate you and @trivialfis 's help.

trivialfis commented 5 years ago

@Bluesn0w No problem. The previous crash should only happen when multiple GPUs are in used, and it was a deterministic bug. For your curiosity, the cause for the crash is we don't support changing the number of GPUs, but GPUPredictor uses 1 GPU whilst other components use multiple of them. As for why only 1 GPU is utilized in your old environment, I'm afraid I can't answer that without operating on that particular VM. But I don't think that's the problem inside XGBoost since we have tests on it and we haven't seen any failure on our side.

I'm closing this. Thanks for the report and help!

dmlc / xgboost

Multi-GPU Does not Work with Cuda 10.0 #3812