BVLC / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
34.04k stars 18.7k forks source link

92% of cpu cycles in Forward() pass spent on caffe_set for Conv2D bias term #5440

Open Voax001 opened 7 years ago

Voax001 commented 7 years ago

GPU mode+CUDNN, caffe/windows (pulled and built 6 days ago: commit ca360a148f70a254a7246490eccc85905e75afa0).

The Forward() method of my network spends 92% of its time in the caffe_set(...) method:

with_bias

CuDNNConv2D::Reshape calls BaseConv2D::Reshape. At base_conv_layer.cpp:248:

if (bias_term_) {
    vector<int> bias_multiplier_shape(1, out_spatial_dim_);
    bias_multiplier_.Reshape(bias_multiplier_shape);
    caffe_set(bias_multiplier_.count(), Dtype(1),
        bias_multiplier_.mutable_cpu_data());
  }

Which copies a float for every pixel in the input, at math_functions.cpp:62: (in my case, N = 2073600, alpha = 1.0)

void caffe_set(const int N, const Dtype alpha, Dtype* Y) {
 ...
  for (int i = 0; i < N; ++i) {
    Y[i] = alpha;
  }
}

Stack:

>   CaffeWrapper.dll!caffe::caffe_set<float>(const int N, const float alpha, float * Y) Line 62 C++
    CaffeWrapper.dll!caffe::BaseConvolutionLayer<float>::Reshape(const std::vector<caffe::Blob<float> *,std::allocator<caffe::Blob<float> *> > & bottom, const std::vector<caffe::Blob<float> *,std::allocator<caffe::Blob<float> *> > & top) Line 251  C++
    CaffeWrapper.dll!caffe::CuDNNConvolutionLayer<float>::Reshape(const std::vector<caffe::Blob<float> *,std::allocator<caffe::Blob<float> *> > & bottom, const std::vector<caffe::Blob<float> *,std::allocator<caffe::Blob<float> *> > & top) Line 94  C++
    CaffeWrapper.dll!caffe::Layer<float>::Forward(const std::vector<caffe::Blob<float> *,std::allocator<caffe::Blob<float> *> > & bottom, const std::vector<caffe::Blob<float> *,std::allocator<caffe::Blob<float> *> > & top) Line 417 C++
    CaffeWrapper.dll!caffe::Net<float>::ForwardFromTo(int start, int end) Line 524  C++
    CaffeWrapper.dll!caffe::Net<float>::Forward(float * loss) Line 551  C++

If I disable the bias_term of my convolutional layers my network trains in 75% of the time and the Forward() method in my deployed network only takes 5% of the time it takes with bias_term. Surely this is not working as intended?

without_bias

At the very least, CuDNNConv2D could use caffe_gpu_set. But does the bias value really need to be expanded for every pixel? Surely it can be added in-place without expanding it first? Does CuDNNConv2D::Forward() really need to call Reshape() at all each frame, when the shape of the inputs has not changed?

willyd commented 7 years ago

Any chance you can share your prototxt files and so that I can try to reproduce this issue? Or better can you see if you get the same issue with the mnist example?

Voax001 commented 7 years ago

Thanks for looking into this issue!

Here are my .prototxts for the deployed network: With bias: deploy_bias.prototxt.txt Without bias: deploy_no_bias.prototxt.txt

As you can see it's a small network consisting only of conv2d's and relu's which performs image filtering. This network uses an InputLayer as data source: I write data directly to the mutable_gpu_data of the input layer and also retrieve the result on the gpu for visualization. This means the cpu does very little which emphasises the issue, but it should still be noticeable with a regular image source.

I'll try to get the mnist sample up and running and do some profiling. I suspect it still wastes cycles in caffe_set but these will be hidden by the sample doing other cpu work as well?

willyd commented 7 years ago

Are you running in Debug mode by any chance?

If you are using VS profiling tools I am not surprised that the caffe_set method takes most of the time since most of your computations are carried on the GPU.

If I disable the bias_term of my convolutional layers my network trains in 75% of the time and the Forward() method in my deployed network only takes 5% of the time it takes with bias_term. Surely this is not working as intended?

How do you time your training execution time?

Voax001 commented 7 years ago

For the deployed network I was running in Debug mode and using VS's profiling tools, yes. You're right that most of the work happens on the GPU, but that caffe_set method is still taking 23% of all cpu cycles (the program also does a bunch of rendering and video decoding etc) and appears to be an avoidable bottleneck. It's filling the bias array on the CPU, so presumably it also copies this array to the GPU which will severely impact bus bandwidth, since it does it for each Forward() call.

I'll try to compare the two networks on a release build asap.

To measure the training time, I simply had both networks run for 10.000 iterations of 32 images via jupyter notebook/pycaffe (release build) with a custom python layer as data source, and compared the wall clock time.

willyd commented 7 years ago

For the deployed network I was running in Debug mode and using VS's profiling tools, yes. You're right that most of the work happens on the GPU, but that caffe_set method is still taking 23% of all cpu cycles (the program also does a bunch of rendering and video decoding etc) and appears to be an avoidable bottleneck.

It is indeed avoided in the BiasLayer where the caffe_set function is called only if the last element of the bias vector is not equal to one. See: https://github.com/weiliu89/caffe/blob/ssd/src/caffe/layers/bias_layer.cpp#L66

You could submit a PR with a similar optimization for conv layer. Why not include InnerProduct layer while you are at it.

I'll try to compare the two networks on a release build asap.

It will be interesting to see if there is still such a huge difference between training with bias or without. I suspect that the difference might not be noticeable.

Anyway, it seems that the issue was already noticed sometime ago.

https://groups.google.com/forum/#!topic/caffe-users/kM918UQ7H2Q

I leave it the other @BVLC members to comment or suggest the best course of action.

Voax001 commented 7 years ago

Thanks for pointing me in the right direction. I fixed this specific performance issue for my application with the following patch: 0001-fixed-layers-always-reshaping-on-Forward-passes.patch.txt

When the Forward method reshapes, it stores those shapes, and on subsequent calls compares bottom's shapes to the stored shapes, and skips reshaping if they are equal. This reduces the amortized time my application spends inside the caffe api from ~25% to ~0.75% because my input shape is constant.

I haven't built caffe with python support yet and therefore have been unable to test the impact on training times, but I expect the gain would be around 25% (difference with a prebuilt release version of pycaffe between training with and without bias for my network).

This solution still leaves a lot to be desired though: