Very Slow run time in C++

ugionet commented 7 years ago

Hello, I have been trying to run a network that was trained in Caffe1 in the new Caffe2 module . I have encountered a very disturbing issue regarding the run time in Caffe2 with respect to caffe1 . I transformed the model trained in caffe1 to caffe2 using the caffe_translator.py file, which created an init_net.pb and predict_net.pb files for caffe2.

I have timed (in C++ ) the time it takes the network to run on an image (caffe2::TensorCPU/caffe2::TensorCUDA) using the following C++ function in caffe2: predict_net->Run();

Where, predict_net is a std::unique_ptr type, which has in its input a TencsorCPU/TensorCUDA data. I did the same thing with caffe1, and timed the corresponding run function in it . Just to make it even more sure, i tried to time the results in python (for caffe2) with the same input of course .

All the results i received (when running in Cpu/Cuda) for caffe2 were exactly the same with respect to the results i received in caffe1, with the same input of course. Also, all processes in C++have run without any debug options, that is, the code has run with code Optimization Run Time Flag Set to : Maximized Speed(/O2). All the C++ codes (both for caffe1 and caffe2) have of course run in Release Configuration.

These are the disturbing results: caffe1 (C++) : 11.35 ms caffe2 CPU (C++) : 200.56 ms caffe2 CUDA (C++) : 308.72 ms caffe2 (python) : 102.36 ms

Also, I ran a profiler to make sure that the functions i am executing (both in CUDA and CPU) match to the appropriate funbctions in C++ , and they do , that is the TensorCuda runs all processes with Cuda processes and TensorCPU run all processes in CPU, which is exactly what should happen.

How is it even possible ? Any one else has encountered this sort of issue ? Please help ... !!

Thanks a lot in advance !

peaceorwell commented 7 years ago

I run into this problem, too. This my time in AlexNet: caffe1(c++):410ms caffe2(c++): 3.2s I think the math lib may be different

ugionet commented 7 years ago

@ZhouYuSong What platform are you running ? Also, which 3rd party dependencies did you build your solution with (i.e., OpenCV, Leveldb, lmdb, Gflags, Glog etc.) .

I am running on Windows 10 64-bit ; I built the solution using Cmake with: OpenCV3.2, and Gflags. I get Accurate output results by the way , that is, I get the same output results for a test image running through my network both with Caffe1 and Caffe2.

peaceorwell commented 7 years ago

I test on ubuntu14.04 x64, opencv2.6， flags and so on.

ugionet commented 7 years ago

Ok I GOT It ! It appears that in C++ , one must specify which engine is to run for each one of the operators (it is true also for python) . For using (for example) the CUDNN operator, one should use the following code:

for (OperatorDef& grad_def : *model.mutable_op()) { grad_def.set_engine("CUDNN"); }

Also, one must use the caffe2:DeviceOption class, to tell caffe2 that we want to use the CUDA capable GPU device, by specifying the "device_type" as well as the "cuda_gpu_id".

Also, one very important thing I discovered, when running with CUDNN options for the first time, it appears that the predict_net.pb initializes many cudnn processes and allocates appropriate memories for them , which causes the run time to significantly increase. So there is a very simple solution to this: Running the prediction on some dummy data (e.g., matrix of zeros) for the first time, and then right after apply the desired image (video) for the network to run on.

By doing all these essential steps I managed to significantly decrease the caffe2 run times, so these are the new run times I managed to achieve with the following architecture: i7-6700 3.40 GHz ; NVIDIA GTX-1060 ; 32Gb RAM :

the network I used is the same as in my original post: caffe2 C++ : 4.5 ms caffe2 Python: 5.5 ms caffe1 C++: 3.3 ms

By the way, for both caffe1 and caffe2 I used CUDA_8 and CUDNN_5_1. It is much better of course but something is still missing. Can anybody help here ?

Thanks a lot,

peaceorwell commented 7 years ago

Your work is very nice. But why caffe2 is still slower than caffe? Do you have the latest time in caffe with CPU and caffe2 with CPU?

ugionet commented 7 years ago

This is exactly my question as well ... I do not have a good reason to run the network on CPU Architecture, since I have a Cuda capable device, But I can easily perform the same tests on CPU.
Also one other thing ... I did not manage to run any network with FP 16 (Float 16) in caffe2, has anyone managed to do so ? (I do not mean training ... only running a network on some arbitrary data). In python (as well as in C++) it does not work. I believe that if we were able to run the Network with FP 16 data type performance will be greatly improved ... but it is not possible at the moment.

ugionet commented 7 years ago

Ok, I have conducted the requested analysis on GoogleNet (Original Network), these are the results: Caffe2 (GPU) : ~15.2 ms Caffe2 (CPU): ~186.1 ms Caffe1 (GPU): ~2.7 ms Caffe1 (CPU): ~111.0 ms I timed both Caffe1 and Caffe2 Runtimes with the same approach (using OpenCV cv::GetTickCount). For a fair comparison I Timed only the time it takes the network to predict some input (same input for that matter for Caffe1 and Caffe2). I.e.: Caffe2 : predict_net->Run(); Caffe1: Net->forward();

where in caffe1, Net is caffe::Net Type. and in caffe2, predict_net is a caffe2::NetBase Type.

Both runtimes were of course evaluated on the same pc.

peaceorwell commented 7 years ago

Thanks for your work, It's very helpful. About the reason why caffe2 is slow, I think the function stack of caffe2 is deeper than caffe, so the total time is longer.

ugionet commented 7 years ago

Ok .... so what can we do to improve that ?

ugionet commented 7 years ago

By using MKL library in the compilation I managed to reduce the run time (in C++ using CPU ) to 70 ms running with google net (which is about 45 ms faster then caffe) which is great !! Unfortunately, the GPU run time in caffe 2 is inferior to caffe ... I will try to update my cudnn library and hopefully the run times will be better ... though I think that there should be done a better job in the GPU Cuda Code of caffe 2 . It appears that the convolution (and max pooling) operators and are not very efficient... to say the least . I will keep update this thread of course .

yxchng commented 7 years ago

@ugionet have you found out the reason? or is caffe 2 really slower?

ugionet commented 7 years ago

@yxchng Unfortunately No ... neither Cudnn 7 nor any other trick i tried have improved the run time performance, And as you may notice this thread had not received any official response. I gave up on Caffe2 for now .... Moved to TensorFlow, which by the way also shows inferior performance (Run Time) with respect to Caffe ... but it is not that bad.

dlwtojd26 commented 7 years ago

@ugionet Hi, did you build caffe2 with openmp=ON?. I tried to test on cpu but my run time results were very bad.

facebookarchive / caffe2

Very Slow run time in C++ #921