Maratyszcza / caffe-nnpack

Caffe with NNPACK integration
Other
59 stars 28 forks source link

Not using multiple cores? #3

Closed shihenw closed 8 years ago

shihenw commented 8 years ago

Hi @Maratyszcza,

I am running VGG-19 with caffe-nnpack on a machine with 2 Intel Xeon E5-2660 v3 Haswell 2.6 GHz CPUs (total 20 cores). I was able to get similar result compared to numbers shown in readme.md: (timed with CPUTimer in caffe/util/benchmark)

conv3_1: 162251 us
conv3_2: 329318 us
conv4_1: 177468 us
conv4_2: 392127 us

The speedup is fantastic. However, from the command top, the number for CPU% never exceeded 100% during the run. If all cores are fully utilized, shouldn't I observe a number significantly larger than 100%? Like 2000% for this machine?

Maratyszcza commented 8 years ago

You can set OMP_NUM_THREADS to the number of threads you want to use (NNPACK doesn't use OpenMP, but we decided to use this environment variable for compatibility with other Caffe backends). However, keep in mind that currently NNPACK doesn't scale well beyond 8 threads.

shihenw commented 8 years ago

Hmm, seems not working. I used export OMP_NUM_THREADS=4 and ran the program, but the number of CPU% is the same. I couldn't grep OMP_NUM_THREADS anything from NNPACK or caffe-nnpack either.

Maratyszcza commented 8 years ago

It is in this commit

Maratyszcza commented 8 years ago

It seems nnpack-pr changed this. The new logic is here: https://github.com/ajtulloch/caffe/pull/1/files#diff-606f9e31983814fd9cfbae12c21bba96R11

shihenw commented 8 years ago

I tried multiple thread by hard coding numbers in nnpack_pool.h. But using 2 threads is slower than 1. Is this because I am using openBLAS instead of Intel MKL (as suggested by variable name num_mkl_threads)?

Maratyszcza commented 8 years ago

@shihenw This is not what expected. The only possible explanation I have is that your convolutional layer is very small and the latency to wake up threads is comparable to multi-threaded computation (when 1 thread is used, nnpack-pr doesn't computes on the caller thread).

It is certainly not related to use of OpenBLAS vs MKL.

shihenw commented 8 years ago

Hi, @Maratyszcza I got some speedup with multiple threads, where the speedup becomes more obvious when the workload is heavier (larger batch size). Experimented on my own network, here's what I got:

cpu_batch

YueshangGu commented 5 years ago

How can I set the thread pool size in caffe2? @Maratyszcza @shihenw