CPU omp thread affinity becomes imbalanced after backpass

yuk12 commented 4 years ago

🐛 Bug

I am running graphsage app in dgl+pytorch setting. In minigun code which spawns omp threads to execute e.g copyReduce, I see that multiple omp threads are getting executed on a cpu - somehow thread affinity is getting messed up. This happens only after first backpass which is creating extra threads in addition to the thread created in forward pass. For example, if I set OMP_NUM_THREADS=10 (for 10 cpus), then after the first fwd and bck pass I see there are 20 os threads. Then in the next fwd pass a subset of 10 threads out of the pool of 20 threads are used, leading to the affinity problem.

Environment

DGL Version (e.g., 1.0): latest commit
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): pytorch
OS (e.g., Linux): Centos
How you installed DGL (conda, pip, source): compiled and installed from src code
Build command you used (if compiling from source): make -j
Python version: 3.6/3.7
CUDA/cuDNN version (if applicable):
GPU models and configuration (e.g. V100):
Any other relevant information: Intel CPUs

Additional context

BarclayII commented 4 years ago

I remember that @jermainewang observed a similar issue on GraphSAGE training on CPU.

yuk12 commented 4 years ago

Looks like the affinity is getting messed up right after the creation of the C++ threads in pytorch's autograd during backpass.

BarclayII commented 4 years ago

I guess the reason is probably in our handling of OpenMP scheduling. Currently we do the following OpenMP loops:

#pragma omp parallel for
for (int i = 0; i < n; ++i) {
  // ...
}

However in e.g. PyTorch the OpenMP threads iterate with a much larger chunk size:

#pragma omp parallel
{
  int n_chunks = omp_get_num_threads();
  int chunk_size = DIV_ROUNDUP(n, n_chunks);
  int chunk_id = omp_get_thread_num();
  int start = chunk_id * chunk_size;
  int end = min((chunk_id + 1) * chunk_size, n);
  for (int i = start; i = end; ++i) {
    // ...
  }
}

EDIT: not using parallel for

BarclayII commented 4 years ago

@yuk12 Just curious: did you try one of the thread binding options?

dmlc / dgl