apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

mxnet-mkldnn(v1.2.0) to use whole CPU cores on machines without hyperthreading #10836

Open XiaotaoChen opened 6 years ago

XiaotaoChen commented 6 years ago

here is my cpu info

    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                28
    On-line CPU(s) list:   0-27
    Thread(s) per core:    1
    Core(s) per socket:    14
    Socket(s):             2
    NUMA node(s):          2
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 79
    Model name:            Intel(R) Xeon(R) CPU E5-2690 v4@ 2.60GHz
    Stepping:              1
    CPU MHz:               3499.539
    BogoMIPS:              5205.87
    Virtualization:        VT-x
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              35840K
    NUMA node0 CPU(s):     0-13
    NUMA node1 CPU(s):     14-27

Clearly, each core only have one thread, without hyperthreading.

Then i run the benchmark_score.py. the results are as follows

    INFO:root:batch size 16, image/sec: 82.011372
    INFO:root:batch size 32, image/sec: 86.430563
    INFO:root:batch size 64, image/sec: 90.050148
    INFO:root:batch size 128, image/sec: 90.267582
    INFO:root:batch size 256, image/sec: 90.518156

here is the utilization of each core

Threads:  20 total,  14 running,   6 sleeping,   0 stopped,   0 zombie
%Cpu(s): 49.2 us,  0.9 sy,  0.0 ni, 50.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13191868+total, 12510168+free,  5682128 used,  1134872 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 12540582+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                          nTH  P 
29480 daodao    20   0 4993856 1.786g  33724 R 99.9  1.4   0:37.55 python                                            20 16 
29482 daodao    20   0 4993856 1.786g  33724 R 99.9  1.4   0:37.55 python                                            20 12 
29483 daodao    20   0 4993856 1.786g  33724 R 99.9  1.4   0:37.55 python                                            20 17 
29484 daodao    20   0 4993856 1.786g  33724 R 99.9  1.4   0:37.55 python                                            20 18 
29486 daodao    20   0 4993856 1.786g  33724 R 99.9  1.4   0:37.53 python                                            20  8 
29490 daodao    20   0 4993856 1.786g  33724 R 99.9  1.4   0:37.53 python                                            20 10 
29491 daodao    20   0 4993856 1.786g  33724 R 99.9  1.4   0:37.53 python                                            20 21 
29478 daodao    20   0 4993856 1.786g  33724 R 99.7  1.4   0:37.37 python                                            20 27 
29479 daodao    20   0 4993856 1.786g  33724 R 99.7  1.4   0:37.54 python                                            20 15 
29481 daodao    20   0 4993856 1.786g  33724 R 99.7  1.4   0:37.54 python                                            20  6 
29485 daodao    20   0 4993856 1.786g  33724 R 99.7  1.4   0:37.54 python                                            20  9 
29487 daodao    20   0 4993856 1.786g  33724 R 99.7  1.4   0:37.53 python                                            20 20 
29488 daodao    20   0 4993856 1.786g  33724 R 99.7  1.4   0:37.53 python                                            20 19 
29489 daodao    20   0 4993856 1.786g  33724 R 99.7  1.4   0:37.53 python                                            20 11 
29468 daodao    20   0 4993856 1.786g  33724 S  0.0  1.4   0:01.59 python                                            20  3 
29471 daodao    20   0 4993856 1.786g  33724 S  0.0  1.4   0:00.00 python                                            20  1 
29474 daodao    20   0 4993856 1.786g  33724 S  0.0  1.4   0:00.00 python                                            20 15 
29475 daodao    20   0 4993856 1.786g  33724 S  0.0  1.4   0:00.00 python                                            20 16 
29476 daodao    20   0 4993856 1.786g  33724 S  0.0  1.4   0:00.00 python                                            20  4 
29477 daodao    20   0 4993856 1.786g  33724 S  0.0  1.4   0:00.00 python                                            20 17 

It shows that only 14 cores used by 14 threads. And the 14 threads are created by mkldnn(i guess). the other threads which almost haven't use cpu are create by engine and other components of mxnet.

Analysis

mxnet treats all machines as hyperthreading enabled. however, CNN is computationally intensive application, Using hypertrhreading can't further increase the computing power, but adds additional overhead. So mxnet only create half cpu core threads, then OS will schedule each thread to the independent cpu core, this can avoid extra costs.

And the suggestion( https://zh.mxnet.io/blog/mkldnn ) to set omp_num_threads=vCPUs/2 is to avoid hyperthreading.

export KMP_AFFINITY=granularity=fine,compact,1,0
export vCPUs=`cat /proc/cpuinfo | grep processor | wc -l`
export OMP_NUM_THREADS=$((vCPUs / 2))

According to my cpu info, each core only have 1 thread,without hyper-threading. even if create 28 threads which will run on each physical core independently. those threads won't compete for resources like hyperthreading. Using whole cpu cores seems to improve efficency.

Solution

the source code of engine component in mxnet-mkdnn(v1.2.0) added some functions to set omp_num_threads in openmp.h|openmp.cc file. there are some codes to set the thread nubmer. like the constructor of OpenMP in openmp.cc

OpenMP::OpenMP()
  : omp_num_threads_set_in_environment(is_env_set("OMP_NUM_THREADS")) {
#ifdef _OPENMP
  const int max = dmlc::GetEnv("MXNET_OMP_MAX_THREADS", INT_MIN);
  if (max != INT_MIN) {
    omp_thread_max_ = max;
  } else {
    if (!omp_num_threads_set_in_environment) {
      omp_thread_max_ = omp_get_num_procs();
#ifdef ARCH_IS_INTEL_X86
      omp_thread_max_ >>= 1;
#endif
      omp_set_num_threads(omp_thread_max_);
    } else {
      omp_thread_max_ = omp_get_max_threads();
    }
  }
#else
  enabled_ = false;
  omp_thread_max_ = 1;
#endif
}

according to the constructor, if the user doesn't set OMP_NUM_THREADS, mxnet will set the omp_threads = omp_get_num_procs()/2. the function of omp_get_num_procs can return the whole cpu cores.

So there are to solutions: (1) set environment variable of OMP_NUM_THREADS; (2) rewrite the code as below:

#ifdef ARCH_IS_INTEL_X86
      omp_thread_max_ >>= 1;
#endif

1. set environment variable of OMP_NUM_THREADS

export OMP_NUM_THREADS=28
python /path/to/benchmark_score.py

rewrite the code

Whether to enable hyperthreading or not, ensure the omp_thread_max_ is equal to cpu physical cores.

such as:

// to execute shell in c++, and return the output info as a string.
std::string OpenMP::exec_shell(const char* cmd) {
    std::array<char, 128> buffer;
    std::string result;
    std::shared_ptr<FILE> pipe(popen(cmd, "r"), pclose);
    if (!pipe) throw std::runtime_error("popen() failed!");
    while (!feof(pipe.get())) {
        if (fgets(buffer.data(), 128, pipe.get()) != nullptr)
            result += buffer.data();
    }
    return result;
  }

OpenMP::OpenMP()
  : omp_num_threads_set_in_environment_(is_env_set("OMP_NUM_THREADS")) {
#ifdef _OPENMP
  const int max = dmlc::GetEnv("MXNET_OMP_MAX_THREADS", INT_MIN);
  if (max != INT_MIN) {
    omp_thread_max_ = max;
  } else {
    if (!omp_num_threads_set_in_environment_) {
      omp_thread_max_ = omp_get_num_procs();
#ifdef ARCH_IS_INTEL_X86
//      omp_thread_max_ >>= 1;
    //get physical cpu count
    int physical_cpus = std::stoi(exec_shell("cat /proc/cpuinfo |grep 'physical id'|sort -u|wc -l"));  
    //get the physical core count of each physical cpu
    int cores = std::stoi(exec_shell("cat /proc/cpuinfo |grep 'cores'|sort -u|awk '{print $4}' "));   
    omp_thread_max_ = physical_cpus * cores;
#endif
      omp_set_num_threads(omp_thread_max_);
    } else {
      omp_thread_max_ = omp_get_max_threads();
    }
  }

result

INFO:root:batch size 16, image/sec: 126.780864
INFO:root:batch size 32, image/sec: 126.437795
INFO:root:batch size 64, image/sec: 116.306735
INFO:root:batch size 128, image/sec: 117.576966
INFO:root:batch size 256, image/sec: 135.782432
TaoLv commented 6 years ago

Great analysis. I think setting OMP_NUM_THREADS to the number of physical core is a good choice and you'd better bind all threads to physical cores by setting KMP_AFFINITY.

I'm afraid below code is not portable and cannot run other OS.

int physical_cpus = std::stoi(exec_shell("cat /proc/cpuinfo |grep 'physical id'|sort -u|wc -l"));  
    //get the physical core count of each physical cpu
int cores = std::stoi(exec_shell("cat /proc/cpuinfo |grep 'cores'|sort -u|awk '{print $4}' ")); 

FYI, here are some previous discussions about it: https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-361874591 Contribution is welcome :)

XiaotaoChen commented 6 years ago

yes, the code works only on linux. i tested it on my ubuntu. thx :) @TaoLv

pengzhao-intel commented 6 years ago

Thanks, Xiaotao

Add @cjolivier01 to comment since he implemented this piece of code to set thread number for mxnet.

cjolivier01 commented 6 years ago

one day maybe someone will write some portable code to determine actual number of physical cores. on Linux i’ve seen code that does it which did all sorts of parsing stuff out of /proc directory — it was a lot of code. maybe Intel has an mkl routine that they could share?

chinakook commented 6 years ago

I think hyperthreading improve very small in this highly synchronized numeric computation situation. And pytorch used https://github.com/pytorch/cpuinfo to detect cpu.