facebookarchive / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Apache License 2.0
8.42k stars 1.94k forks source link

caffe2 multicore CPU not work when blas(eigen or openblas) do matrix multiply #1245

Open ElevenDreamer opened 7 years ago

ElevenDreamer commented 7 years ago

hi, I meet a problem when use caffe2 in Pine64. ubuntu~16.04.1. cpu 4 cores Cortex-A53 In caffe2, multicore cannot use when use eigen do matrix multiply. (the same to use openblas) when compile I add -fopenmp , also export OMP_NUM_THREADS=4. I use model alexnet for predict test. when I analysis program cost time. I find fc cost most time. when do matrix multipy ,the cpu only use one core, but we have 4 cores.

for test ,I do the matrix multiply test only use eigen . code: MatrixXf m1 = MatrixXf::Random(1,9000); MatrixXf m2 = MatrixXf::Random(9000,4000); MatrixXf m3(1,4000); int n = Eigen::nbThreads(); int i = 0; for (;i < 10; ++i) { cout << "n thread=" << n << endl; m3 = m1 * m2; }

multicore work show that: image

but when I use the same code in caffe2. I add code in Gemm function in file math_cpu.cc. I use full_connected_op_test for test. but only use one cores show that: image

why the same code cannot use multicore in caffe2?

I cannot find any other way to resolve the problem, anyone who can give a hint,please? Thanks

Yangqing commented 7 years ago

Could you turn on GLOG_v=1 and run the Caffe2 binary again? It might be that caffe2's omp initialization code is setting omp threads back to 1:

https://github.com/caffe2/caffe2/blob/master/caffe2/core/init_omp.cc

ElevenDreamer commented 7 years ago

@Yangqing thanks for reply. I am sorry for not reply in time. GLOG_v already open. default init log show below: image env OMP_NUM_THREADS: image

but the problem still exist.

Another things when we test caffe2 find that: In python test code, workspace.GlobalInit() when we change parameter --caffe2_omp_num_threads ,the convolution layer can use multicore. the cost time will decrease when we set caffe2_omp_num_threads=4. however, the full_connect layer not change. I am afraid that the blas lib cannot use multicore do matrix multiply in caffe2. but in other caffe2 code maybe multicore works .

Thanks for your time.

dlwtojd26 commented 7 years ago

@ElevenDreamer Hi, Did you solve this problem? I have same problem too If you solve this problem, can you give a hint please? Thanks

ElevenDreamer commented 7 years ago

@dlwtojd26 not yet, I try some measures, but cannot solve.I am waiting for official reply. Now we use TF and pytorch first.

dlwtojd26 commented 7 years ago

@ElevenDreamer hmmm, sad news : < Thanks to your reply. If i know how to handle it, I'll reply.