facebookarchive / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Apache License 2.0
8.42k stars 1.94k forks source link

how to enabling nnpack #343

Closed kindloaf closed 7 years ago

kindloaf commented 7 years ago

I'm trying to run inference on Android with nnpack enabled. I did the following, but didn't notice any performance differences: (1) make sure USE_NNPACK is ON in CMakeLists.txt (2) take the .pb files from zoo model, and convert them to use NNPACK, according to this page. I saw that the init_net.pb wasn't changed, but predict_net.pb was changed. The inference time of bvlc_googlenet on my device was the same, before and after conversion. Am I missing something?

kindloaf commented 7 years ago

update: I just realized that I used a binary without NNPACK compiled previously. Now I used the right binary, but observed that the network with NNPACK is slower (the inference time is 3x compared to network without NNPACK). Any advice?

Yangqing commented 7 years ago

Hmm, it might be because NNPACK was not built with optimized flags - because it is C files, you might want to change the CMAKE_C_FLAGS in addition to CMAKE_CXX_FLAGS as you mentioned in the other issue - want to give that a shot?

Yangqing commented 7 years ago

Also, you might want to check if the default build script targets your platform: the default is -march=armv7-a -mfloat-abi=softfp -mfpu=neon and if you have aarch64 you may want to change it.

power0341 commented 7 years ago

hi @kindloaf, @Yangqing, could you tell me how to properly convert a pretrained model to nnpack enabled, I built an ios demo and manually set op.set_engine("NNPACK") for those Conv Layers, however, when I traced the function calls, I saw that the conv_op were still done by Eigen. 2017-05-03 4 05 13 Did I miss something? By the way, I used the intact predictor class in C++ context, which seemed to never switch to other engines, neither CUDNN nor NNPACK.

kindloaf commented 7 years ago

@power0341 Did you do the following? (1) compiled the binary with nnpack enabled (make sure the switch is on in CMakeLists.txt) (2) use the python script in this page to convert the pre-trained model to use "NNPACK" and "BLOCK" engine.

power0341 commented 7 years ago

@kindloaf I did switch on the nnpack option in CMakeLists.txt. This led to two additional libraries "libCAFFE2_NNPACK.a" and "libCAFFE2_PTHREADPOOL.a" been created.

For the second step, which is the one confused me most, I have it done in C++ source code, specifically, I loaded the NetDef predict_net and set the engines of convolution layers to be "NNPACK", then I created the net as a Predictor object. When printing the predictor->def().DebugString(), we see that

  arg {
    name: "engine"
    s: "NNPACK"
  }

2017-05-04 9 27 59

I believe this is equivalent to the python version. Any suggestion?

kindloaf commented 7 years ago

@power0341 From the debug string, it seems right to me. Not sure why it's still using Eigen...

raininglixinyu commented 7 years ago

@power0341 Hi, how did you set the engine to "NNPACK" in the c++ source code? I am trying to do this too. Could you share your code?

raininglixinyu commented 7 years ago

@kindloaf I also observed that the network with NNPACK is slower. Did you have any progress on that?

kindloaf commented 7 years ago

Hi @raininglixinyu I don't have any update on this.

zfc929 commented 7 years ago

@kindloaf @Yangqing @power0341 Hi, how did you set the engine to "NNPACK" in the c++ source code? I am trying to do this too. Could you share your code?

zfc929 commented 7 years ago

726 solve the proplem.

HyunjunShin commented 6 years ago

I missed option "--recursive" when I cloned the git. You should command "git clone --recursive https://github.com/caffe2/caffe2.git" to clone 3rd party gits in "https://github.com/caffe2/caffe2/tree/master/third_party".

Hope it works.

YueshangGu commented 5 years ago

I can run inference code with NNPACK as engine, but how can I set the threadpool size of NNPACK which is default the total number of CPU's core? But I don't need the all CPU's cores use for test. What should I do for it?