Multi-GPU operation and data / model Parallelism

shelhamer commented 10 years ago

Multi-GPU operation and data / model / hybrid parallelism are planned and in development for Caffe. The purpose of the thread is to focus the conversation, since this has been asked here, there, and everywhere. There are several ways to approach parallelization, so feel free to discuss your own work to this end here.

Note that Caffe does work with multiple GPUs in a standalone fashion right now: you can train on one GPU while extracting features on another and so on.

Yangqing commented 10 years ago

Note that training with multiple GPUs + data parallelism is also trivially possible with MPI - for model parallelism it is more nontrivial, though.

kloudkl commented 10 years ago

Does the data parallelism suggests shared model parameters? If it doesn't and the data is split before hand, the data parallelism can be implemented with a shell script.

kloudkl commented 10 years ago

Unfortunately, the weights of the replicas of the same layer have to be synchronized as described in the section 4.1 of [1].

[1] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997 [cs.NE]

kloudkl commented 10 years ago

There are tens of thousands of lines in the diff between cuda-convnet2/cudaconvnet and cuda-convnet. It may not be plausible to reproduce all of Alex's work in a short time. Is it acceptable to just wrap it as @soumith did in cuda-convnet2.torch? The members of the Pylearn2 community have successfully wrapped cuda-convnet and are planning to upgrade to cuda-convnet2.

visionscaper commented 10 years ago

cc me

palmforest commented 10 years ago

Hope make it real on Caffe soon.

bug-fixed commented 10 years ago

I just have a thought and post here for discussion. When the training data is located in a distributed environment, thus is it possible or necessary to develop a dispatch server and along with the parameter database server to work together? In the independent client, there are multiple computing resources(such as NVIDIA GPU, AMD or others), thus client programs are running in these clients. These clients communicate with the dispatch server or parameter database, the dispatch server is just responsible for the whole procedure for updating parameter database. The clients should have many interfaces to complete independent tasks on its own part of data and the client programs should be adaptive to the computing resources for best performance.

kloudkl commented 10 years ago

The training can be done by cuda-convnet2. It't only necessary to convert the model into Caffe's format.

kloudkl commented 10 years ago

The "trivially possible data parallelism" can be implemented with CUDA Multi Process Service (MPS).

kloudkl commented 10 years ago

@Yangqing, could you give us some hints how to integrate Caffe and CUDA MPS with MPI. Does the solver of each process have to communicate with each other? Or do they share the same CUDA context which automatically combines the memory of multiple GPUs into a single virtual address space?

bhack commented 10 years ago

I don't know if on multiple hosts we could explore something with spark and caffe python bindings. There is also already a deep network experiments on spark

kloudkl commented 10 years ago

Mixing multiple languages together is not a good idea.

bhack commented 10 years ago

@kloudkl I'm talking about using python bindings already available in caffe in pyspark

kloudkl commented 10 years ago

The remarks of Evan Sparks @etrain posted by @sergeyk prove that it is plausible to "use cuda-convnet2 to train the models offline, but use Caffe to parse/analyze them and apply the models to new input images".

madisonmay commented 10 years ago

With the recent addition of cuDNN to the dev branch of caffe, are multiple gpu's now supported? The recent article on cuDNN indicates that cuDNN supports parallelism across gpu's, but doesn't mention whether this support is present in the Caffe wrap.

shelhamer commented 10 years ago

Caffe and cuDNN alike are single-GPU libraries at the moment but they can be run on multiple GPUs simultaneously in a standalone way.

Multi-GPU parallelism is still in development in Caffe.

On Monday, September 8, 2014, Madison May notifications@github.com wrote:

With the recent addition of cuDNN to the dev branch of caffe, are multiple gpu's now supported? The recent article on cuDNN http://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/ indicates that cuDNN supports parallelism across gpu's, but doesn't mention whether this support is present in the Caffe wrap.

— Reply to this email directly or view it on GitHub https://github.com/BVLC/caffe/issues/876#issuecomment-54778357.

madisonmay commented 10 years ago

In other words, multiple gpus can be used for tasks like hyperparameter selection but not to allow more efficient training of a single model?

bhack commented 10 years ago

There is something interesting in TBB Graph flow parallelization. This a feature detector example.

futurely commented 9 years ago

While #2114 and a series of related PRs https://github.com/BVLC/caffe/issues/1535#issuecomment-117018020 have solved the data parallelism, there isn't yet a pull request dedicated to model parallelism. Here is Facebook's implementation just for reference.

PhoenixDai commented 9 years ago

Looks like Nvidia has figured out how to train model on Caffe with multiple GPUs. This page https://developer.nvidia.com/digits showed performance comparison on training with different number of GPUs. In the video on the page, it showed how to use multiple GPUs with DIGITS. Is there any plan to have this feature in Caffe?

choosehappy commented 9 years ago

@PhoenixDai 's comment is related to the digits github issue: https://github.com/NVIDIA/DIGITS/issues/92. It seems they have forked their own caffe version which supports multiple gpus?

thatguymike commented 9 years ago

The Nvidia branch uses #2114

shaibagon commented 9 years ago

Linking to SO related question.

futurely commented 9 years ago

The master branch supports multi-GPU training. Please refer to the latest documents. https://github.com/BVLC/caffe/blob/master/docs/multigpu.md https://github.com/BVLC/caffe/blob/master/docs/tutorial/interfaces.md

xiaoxiongli commented 8 years ago

thank you for your great job! Now I am training googlenet in my K80, as you know , K80 has 2 core, and I enable these 2 core by "-gpu 0,1", the training speed is faster!

I know the cuda-convnet2 using the method introduced by "Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks.", Is that the mothod caffe using for Multi-GPU Parallelism?

tuonion commented 8 years ago

Hi ,@shelhamer.

Does Caffe support 'model / hybrid parallelism' as you mentioned above?

Lisandro79 commented 7 years ago

Hi, I am working on a project that requires the use of 4 GPUs on a server to analyze images. I would like to do it in caffe (I prefer it over torch or tensorflow) but it seems that multiple GPU is still not available for test / inference.

Is there any estimated date for a version update of caffe that will allow using multiple GPUs for test / inference?

Thanks a lot

cypof commented 7 years ago

We don't have a plan to add that right now but I would be happy to help. You can split your dataset and test each part independently, or distribute items to multiple nets as you go. Are you using the Python API?

On Dec 7, 2016 4:07 AM, "Lisandro" notifications@github.com wrote:

Hi, I am working on a project that requires the use of 4 GPUs on a server to analyze images. I would like to do it in caffe (I prefer it over torch or tensorflow) but it seems that multiple GPU is still not available for test / inference.

Is there any estimated date for a version update of caffe that will allow using multiple GPUs for test / inference?

Thanks a lot

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BVLC/caffe/issues/876#issuecomment-265342931, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4RXnBquB4zqk7j_Vx4jdqo67nL-gfrks5rFiLjgaJpZM4CU9yK .

Lisandro79 commented 7 years ago

Hi Cypof,

Thanks for your fast response. I am not sure I completely understand your suggestions. Let me first provide you with more details about the task.

We have a web server that receives requests (image files) from users. This server has a queue of requests at any given time. Our workstation has 4 TitanX GPUS within a motherboard. So I need to use the 4 GPUs to speed up (four times) the processing time of my queue. The requests will be handled as follows:

request 1 GPU:0; request 2 GPU:1; request 3 GPU:2; request 4 GPU:3; request 5 GPU:0; request 6 GPU:1; ...

I am using caffe with the Python API. The problem comes with the selection of the GPU. If I select GPU= 0 with the first request of the queue

caffe.set_mode_gpu() caffe.set_device(0) %% run inference

then I cannot select GPU=1 with the next request. Even If I load the caffe model and a model in torch, torch cannot use GPU=1 after caffe has set the device to '0' because caffe "locks" all other GPUs.

So regarding your suggestions

1- "You can split your dataset and test each part independently". I think this solution does not apply to this case (please correct me if I am wrong)

2- "Distribute items to multiple nets as you go" Is this similar to what I described above (using one network in caffe and another network in torch in different GPUS) but with multiple nets in caffe? Could you please elaborate a bit more on this?

Thank you very much for you help

pythonanonuser commented 7 years ago

@Lisandro79 did you figure out a solution to this issue? I have a similar problem

Lisandro79 commented 7 years ago

Hi @pythonanonuser,

Unfortunately, I have to say that my solution will be to use TensorFlow. I could not find a solution to my problem in Caffe.

In my opinion, the lack of parallelism for testing of Caffe is a major disadvantage for deployment of web applications. I would like to know what other members of the Caffe community have to say about this.

I really like Caffe and I would prefer to use it over other libraries, but at the moment I find that the parallel capabilities of Tensorflow plus the use of tensorboard do make a difference for production.

Best

cypof commented 7 years ago

I meant to prototype something but haven't got to it. I think the easiest way would be to use multiprocessing, either a Queue, and one Process per GPU, or maybe a Pool so that you can call map() on your inputs and directly get your ouputs. It also depends on where you want to store the results etc.

shelhamer commented 7 years ago

Closing as NCCL + pycaffe #4563 is an effective approach to data parallel training of any kind of Caffe net. More involved forms of parallelism can be left to further efforts.

BVLC / caffe

Multi-GPU operation and data / model Parallelism #876