BVLC / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
34.06k stars 18.7k forks source link

Parallelize Forward / Backward by Depth #547

Open shelhamer opened 10 years ago

shelhamer commented 10 years ago

Forward and Backward are done in sequence by layer ID at the moment. In principle, all Forward / Backward steps at the same depth in the DAG can be executed in parallel.

In DAG models where single layer operations do not saturate the host / device, this should improve performance.

As I understand it, this would be done by batch cuBLAS and streams for parallel kernel execution at each depth in the model.

bhack commented 10 years ago

One of the design that this feature could speed up I think that is the model in the diagram at page 13 of this pubblication: http://arxiv.org/abs/1312.6082v4

shelhamer commented 10 years ago

Pyramids and any model with late fusion [1, plus others and more to come] should likewise benefit.

[1] Large-Scale Video Classification with Convolutional Neural Networks Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei CVPR 2014. http://cs.stanford.edu/people/karpathy/deepvideo/deepvideo_cvpr2014.pdf

bhack commented 10 years ago

Fresh meat from cvpr :meat_on_bone:

sguada commented 10 years ago

Actually any two non overlapping paths could be run in parallel, even if they have different length.

On Thursday, June 26, 2014, bhack notifications@github.com wrote:

Fresh meat from cvpr [image: :meat_on_bone:]

— Reply to this email directly or view it on GitHub https://github.com/BVLC/caffe/issues/547#issuecomment-47276865.

Sergio

shelhamer commented 10 years ago

@sguada right, advancing by depth covers that case too: execute in parallel depth-by-depth and if any particular path completes that's fine, just keep going until the execution of the deepest layer. There's no requirement for equal length.

There has to be some logic to decide the number of streams / handles though. To start this could simply be manually selected.

kloudkl commented 10 years ago

The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. http://tez.incubator.apache.org/

bhack commented 10 years ago

@kloudkl I don't know if also graphlab and graphchi could be useful: http://graphlab.org/projects/index.html http://docs.graphlab.org/index.html

shelhamer commented 10 years ago

A simple graph traversal to make a depth -> layers mapping should suffice for our purposes.

Thanks for the project pointers all the same.

bhack commented 10 years ago

Yes but what kind of parallelization paradigm? Multi thread, multi device, (long term) distributed, or any combination of this.

sguada commented 10 years ago

My only concern with paths of different lengths is that one can be faster/slower than other and then computing by depth will make all the paths run at the speed of the longest. But probably it doesn't matter much if paths have similar times and will have to merge at point and wait.

Sergio

2014-06-27 11:49 GMT-07:00 bhack notifications@github.com:

Yes but what kind of parallelization paradigm? Multi thread, multi device, (long term) distributed, or any combination of this.

— Reply to this email directly or view it on GitHub https://github.com/BVLC/caffe/issues/547#issuecomment-47386536.

shelhamer commented 10 years ago

@bhack

Multi thread, multi device, (long term) distributed, or any combination of this.

Our parallelization goal is entirely single node. We have single process multi-thread / multi-device parallelism in mind.

Distributed computation has its place, but in my opinion there's no point pursuing it while there are still important single node gains to be made.

Of course anyone is free to pursue whatever parallelization they want, but this is the present direction of the project.

@sguada

But probably it doesn't matter much if paths have similar times and will have to merge at point and wait.

That was my thinking. We can always engage in fancier parallelization later if need-be, but depth ordering should suffice.

kloudkl commented 10 years ago

Have you tried to parallelize on a multi-device node using NVBLAS(#194) which only requires dynamically linking the shared library?

shelhamer commented 10 years ago

@kloudkl no, because I want to control the communication and only distribute layer-wise. The only time a parallelized forward / backward pass needs to communicate the data / diff is when a DAG model forks. At that point one path can keep computing while the data / diff are communicated to devices for the other path, which "hides" the communication while useful work is done.

It could be interesting to try distributing all BLAS operations with NVBLAS, but I expect it to not be worth the communication at standard input sizes. Worth noting all the same since only benchmarks will tell.

On Sun, Jun 29, 2014 at 7:04 PM, kloudkl notifications@github.com wrote:

Have you tried to parallelize on a multi-device node using NVBLAS(#194 https://github.com/BVLC/caffe/issues/194) which only requires dynamically linking the shared library?

— Reply to this email directly or view it on GitHub https://github.com/BVLC/caffe/issues/547#issuecomment-47489087.

kloudkl commented 10 years ago

NVBLAS is such a low-hanging fruit that it is really worth some benchmarks. But I don't have acess to multi-GPU devices in the near future. I hope someone interested will be able to do so.

shelhamer commented 10 years ago

I'm a little skeptical because it has a shared host memory model that doesn't mesh with Caffe's lazy allocation and communication-minimizing design. It doesn't seem like you can just give it a GPU memory pointer and accelerate away.

That said, I've only given a cursory look at cublasXT and would welcome example code and benchmarking that turn my impression on its head.

On Sun, Jun 29, 2014 at 7:35 PM, kloudkl notifications@github.com wrote:

NVBLAS is such a low-hanging fruit that it is really worth some benchmarks. But I don't have acess to multi-GPU devices in the near future. I hope someone interested will be able to do so.

— Reply to this email directly or view it on GitHub https://github.com/BVLC/caffe/issues/547#issuecomment-47490068.