cctbx / cctbx_project

Computational Crystallography Toolbox
https://cci.lbl.gov/docs/cctbx
Other
228 stars 119 forks source link

Run_tests_parallel uses only one core #956

Open Trzs opened 10 months ago

Trzs commented 10 months ago

On Perlmutter and friends, run_tests_parallel runs tests in parallel, but all tests are run on just one core.

I created a small reproducer that narrows it down to certain module imports.

main script:

import subprocess
from multiprocessing import Pool

commands = [["libtbx.python", "dummy.py"]]*10

pool = Pool(processes=10)
for cmd in commands:
    pool.apply_async(subprocess.Popen(cmd))
pool.close()
pool.join()

dummy.py (the import are not important, they just to show that python import run fine)

from boost_adaptbx import boost
#import boost_adaptbx.boost.python as bp
#import boost_python_meta_ext
#import boost_tuple_ext

import inspect
import os
import re
import sys
import warnings
import numpy as np

from libtbx import cpp_function_name

x = 0
for i in range(10**8):
    x += 1.3*i

Without the comments, the dummy scripts run on 10 cores. With either one, it's down to one core. Wrapping the import in os.sched_getaffinity and os.sched_setaffinity helps, but is not a real solution.

Trzs commented 10 months ago

Importing the modules somehow changes the affinity:

In [1]: def get_affinity():
   ...:   for line in open('/proc/self/status'):
   ...:     if 'Cpu' in line:
   ...:       print(line)
   ...:   return
   ...:

In [2]: get_affinity()
 Cpus_allowed:  ffffffff,ffffffff,ffffffff,ffffffff

 Cpus_allowed_list: 0-127

In [3]: import boost_python_meta_ext

In [4]: get_affinity()
 Cpus_allowed:  00000000,00000000,00000000,00000001

 Cpus_allowed_list: 0
Trzs commented 10 months ago

Logging the changes with strace libtbx.python dummy.py > trace.log 2>&1, something is changing the affinity:

[...]
sched_getaffinity(334914, 16, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127]) = 16
[...]
sched_setaffinity(334914, 16, [0])      = 0
[...]
bkpoon commented 10 months ago

Can you list your packages? I copied your get_affinity test into a file and I do not get the change in affinity with a newly created environment with cctbx-base on one of our servers.

test.py

def get_affinity():
  for line in open('/proc/self/status'):
    if 'Cpu' in line:
      print(line)
  return

if __name__ == '__main__':
  get_affinity()

  import boost_python_meta_ext

  get_affinity()
[bkpoon@anaconda:tmp] conda create -n py39 cctbx-base python=3.9
[bkpoon@anaconda:tmp] conda activate py39
(py39) [bkpoon@anaconda:tmp] python test.py
Cpus_allowed:   ffff,ffffffff,ffffffff,ffffffff,ffffffff

Cpus_allowed_list:      0-143

Cpus_allowed:   ffff,ffffffff,ffffffff,ffffffff,ffffffff

Cpus_allowed_list:      0-143
Trzs commented 10 months ago

It seems this behaviour is caused by OMP_PLACES and OMP_PROC_BIND. Unsetting these leads to the expected behavior. These were set for Kokkos.

more info: https://github.com/pytorch/pytorch/issues/49971 https://github.com/OpenMathLib/OpenBLAS/issues/2238

The core issue seems to be a bug when OMP_PLACES is set to threads. As far as I know, I am not using OpenBLAS, but the same bug might occur in some other library.

Trzs commented 10 months ago

Current workaround to suppress Kokkos warnings: export OMP_PLACES=threads and export OMP_PROC_BIND=false

Interaction of these settings with MPI is still an open question.