[RFC] Remove support for single process multi-GPU

RAMitchell commented 5 years ago

This issue proposes removing support for single process multi-GPU in xgboost. What this means is that we would no longer support the 'n_gpus' parameter and the ability to train with multiple GPUs without some distributed framework.

We would focus on multi-GPU support through distributed frameworks such as Dask (see #4473) and Spark, where the framework would be expected to assign a worker process for each GPU independently.

The reasoning is that it would greatly simplify all GPU code in xgboost. Currently GPU code works by accepting an input DMatrix and then manually partitioning rows across each available GPU. This partitioning as well as managing multiple GPUs via threads imposes significant complexity on developers. The result is more code that is harder to read, more bug prone (we have many issues documenting this) and less flexible when developing new features. It is difficult to overstate how beneficial this would be for developers currently working on GPU code.

By doing this we are effectively deferring the task of partitioning data for a single machine with multiple GPUs onto the higher level distributed framework.

In the short term this will be a direct removal of an existing feature, but I believe in the medium term (1 year) it will result in a significantly better product.

We would provide a path for users to transition to new multi-GPU support. This will be better for users in the long run also as these distributed solutions are arbitrarily scalable. Between making the decision to deprecate this functionality and removing code we would wait at least three months.

This is still very much up for debate and feedback is welcome from developers and users, particularly those who currently use multi-GPU support as a part of their workflow.

TimZaman commented 5 years ago

Yes! This is one of reasons why in DL land horovod got so popular. The single process story of TensorFlow was too complicated, and didn't easily translate to multinode either. Nowadays most multi gpu code I see works with horovod and torch.distributed. Both are in the multiprocess+single-gpu-per-process paradigm. This also naturally scales to multiple nodes without any code editting and provides the user (and dev) with a simpler multigpu programming model.

trivialfis commented 5 years ago

I like the idea of decoupling data distribution from algorithm. But I insist that this is better done in c++ instead of high level libraries like dask, there are R and Julia packages that don't attract much attentions as others. I want least maintenance burden added to language bindings. Those thick bindings are much harder to maintain than current slightly bloated internal code. Also currently both Java and scala bindings are maintained by @CodingCat , we might need to join the development or at least being part of review process if language dependent external packages are used.

trivialfis commented 5 years ago

@TimZaman I thought the way TF distributing subgraph is quite elegant. Seems I'm a little behind on this. :)

datametrician commented 5 years ago

Agree with @TimZaman. RAPIDS/cuML is also following a one process per GPU (OPG) paradigm and we feel it's much simpler to develop and maintain (no hard evidence but just from our initial development compared to single process multi-GPU).

pseudotensor commented 5 years ago

Pros: 1) xgboost simpler for developers 2) xgboost multi-GPU performs poorly. Would need significant attention to make it beneficial.

Cons: 1) Unable to easily run a single model on multiple GPUs. Required to use dask or other 3rd party libraries, so xgboost by itself becomes strongly dependent on 3rd party libraries for this functionality that it already had.

2) No longer able to train single model for big data. But xgboost's memory usage is extreme, and lightgbm offers vastly better memory footprint making multi-GPU xgboost anyways not useful.

I would have wanted xgboost multi-GPU to work very well, using nvlink or whatever. But if that's not going to happen, then there's not much point in keeping the multi-GPU capability in xgboost.

RAMitchell commented 5 years ago

I like the idea of decoupling data distribution from algorithm. But I insist that this is better done in c++ instead of high level libraries like dask, there are R and Julia packages that don't attract much attentions as others. I want least maintenance burden added to language bindings. Those thick bindings are much harder to maintain than current slightly bloated internal code. Also currently both Java and scala bindings are maintained by @CodingCat , we might need to join the development or at least being part of review process if language dependent external packages are used.

@trivialfis I don't see how to do this. Consider that HostDeviceVector is everywhere through our API, the purpose of this RFC would be to remove the complexity of multi-GPU data partitioning from data structures like this.

I look at this as a question of scope. What functionality are we as the developers with limited resources prepared to provide. It's very tempting to try to do everything but I think many of the features we currently for GPU have are not implemented well enough for my liking and the quality of the code base is not at the level I want it to be. You are correct that we will not provide out-of-the-box functionality to use multiple GPUs on R and Julia but @pseudotensor is also right that the multi-GPU solution is not particularly compelling vs. single GPU.

I would personally be satisfied if we produce a high quality building block that others could use to implement a distributed machine learning system.

trivialfis commented 5 years ago

@RAMitchell One problem at a time.

So there's need to reduce memory usage, for both dense and sparse.
Also multiple GPUs will have different memory footprint.
Then we need to think of a solution for robust, scalable data distribution.

Let's devide and conquer. :)

sh1ng commented 5 years ago

I'd start with Design Doc and POC showing how to parallalize the algorithm without introducing too much overhead. Correct me if I'm wrong, but finding best splits requires (2^LEVEL)*N_FEATURES comparisons if a tree's node doesn't get partitioned on a few devices and (2^LEVEL)*N_FEATURES*N_BINS if it does. Those operation will require cross-process communication that may introduce some overhead.

At the end if it scales good enough it overweights downsides like extra 3rd party library.

RAMitchell commented 5 years ago

Communication throughput via NCCL is the same across processes as it is for the threaded version. Running the algorithm in parallel is essentially a question of starting a rabit tracker and connecting workers to it. We can certainly document this more to enable others to potentially incorporate this into their distributed framework. I think some kind of blog post explaining the distributed architecture would be helpful for a lot of people.

arnocandel commented 5 years ago

I’m ok with this path - robustness wins over features.

RAMitchell commented 5 years ago

I think we have collected enough feedback without significant objections from customers. I will file a PR enabling deprecation warnings and updating documentation. We will aim to start removing code around September.

trivialfis commented 5 years ago

@RAMitchell Is this line a bug?

https://github.com/dmlc/xgboost/blob/ea4441775405b97d99ebc6b997b726b64564723b/src/learner.cc#L615

RAMitchell commented 5 years ago

I've never seen this error message and we even have distributed gpu tests now, so it must not be reaching this code block.

RAMitchell commented 5 years ago

The documentation has been updated and deprecation warnings will now be issued. I will leave this issue open until we start removing this feature in case of any further feedback.

rongou commented 5 years ago

Looks like half the cuda code is dealing with multi-gpu. Does anyone have any idea on how to proceed? Just bite the bullet and forge ahead? :)

Some of the DMatrix refactoring might be blocked by this.

@RAMitchell @trivialfis @sriramch

trivialfis commented 5 years ago

@rongou @RAMitchell Can we close this now? What we have to do is gradually making code clean up.

rongou commented 5 years ago

I'm working on #4773 which should be bulk of the changes needed. Maybe we can close this after it goes in?

dmlc / xgboost

[RFC] Remove support for single process multi-GPU #4531