Open Voax001 opened 7 years ago
Any chance you can share your prototxt files and so that I can try to reproduce this issue? Or better can you see if you get the same issue with the mnist example?
Thanks for looking into this issue!
Here are my .prototxts for the deployed network: With bias: deploy_bias.prototxt.txt Without bias: deploy_no_bias.prototxt.txt
As you can see it's a small network consisting only of conv2d's and relu's which performs image filtering. This network uses an InputLayer as data source: I write data directly to the mutable_gpu_data of the input layer and also retrieve the result on the gpu for visualization. This means the cpu does very little which emphasises the issue, but it should still be noticeable with a regular image source.
I'll try to get the mnist sample up and running and do some profiling. I suspect it still wastes cycles in caffe_set but these will be hidden by the sample doing other cpu work as well?
Are you running in Debug mode by any chance?
If you are using VS profiling tools I am not surprised that the caffe_set method takes most of the time since most of your computations are carried on the GPU.
If I disable the bias_term of my convolutional layers my network trains in 75% of the time and the Forward() method in my deployed network only takes 5% of the time it takes with bias_term. Surely this is not working as intended?
How do you time your training execution time?
For the deployed network I was running in Debug mode and using VS's profiling tools, yes. You're right that most of the work happens on the GPU, but that caffe_set method is still taking 23% of all cpu cycles (the program also does a bunch of rendering and video decoding etc) and appears to be an avoidable bottleneck. It's filling the bias array on the CPU, so presumably it also copies this array to the GPU which will severely impact bus bandwidth, since it does it for each Forward() call.
I'll try to compare the two networks on a release build asap.
To measure the training time, I simply had both networks run for 10.000 iterations of 32 images via jupyter notebook/pycaffe (release build) with a custom python layer as data source, and compared the wall clock time.
For the deployed network I was running in Debug mode and using VS's profiling tools, yes. You're right that most of the work happens on the GPU, but that caffe_set method is still taking 23% of all cpu cycles (the program also does a bunch of rendering and video decoding etc) and appears to be an avoidable bottleneck.
It is indeed avoided in the BiasLayer where the caffe_set function is called only if the last element of the bias vector is not equal to one. See: https://github.com/weiliu89/caffe/blob/ssd/src/caffe/layers/bias_layer.cpp#L66
You could submit a PR with a similar optimization for conv layer. Why not include InnerProduct layer while you are at it.
I'll try to compare the two networks on a release build asap.
It will be interesting to see if there is still such a huge difference between training with bias or without. I suspect that the difference might not be noticeable.
Anyway, it seems that the issue was already noticed sometime ago.
https://groups.google.com/forum/#!topic/caffe-users/kM918UQ7H2Q
I leave it the other @BVLC members to comment or suggest the best course of action.
Thanks for pointing me in the right direction. I fixed this specific performance issue for my application with the following patch: 0001-fixed-layers-always-reshaping-on-Forward-passes.patch.txt
When the Forward method reshapes, it stores those shapes, and on subsequent calls compares bottom's shapes to the stored shapes, and skips reshaping if they are equal. This reduces the amortized time my application spends inside the caffe api from ~25% to ~0.75% because my input shape is constant.
I haven't built caffe with python support yet and therefore have been unable to test the impact on training times, but I expect the gain would be around 25% (difference with a prebuilt release version of pycaffe between training with and without bias for my network).
This solution still leaves a lot to be desired though:
GPU mode+CUDNN, caffe/windows (pulled and built 6 days ago: commit ca360a148f70a254a7246490eccc85905e75afa0).
The Forward() method of my network spends 92% of its time in the caffe_set(...) method:
CuDNNConv2D::Reshape calls BaseConv2D::Reshape. At base_conv_layer.cpp:248:
Which copies a float for every pixel in the input, at math_functions.cpp:62: (in my case, N = 2073600, alpha = 1.0)
Stack:
If I disable the bias_term of my convolutional layers my network trains in 75% of the time and the Forward() method in my deployed network only takes 5% of the time it takes with bias_term. Surely this is not working as intended?
At the very least, CuDNNConv2D could use caffe_gpu_set. But does the bias value really need to be expanded for every pixel? Surely it can be added in-place without expanding it first? Does CuDNNConv2D::Forward() really need to call Reshape() at all each frame, when the shape of the inputs has not changed?