binding to certain cores returning `invalid` on `5x5_amazon` resolution

glemieux commented 7 months ago

This issue was discovered after ctsm updated the ccs_confim_cesm version to ccs_config_cesm0.0.92 (https://github.com/ESCOMP/CTSM/pull/2416). Since then, ctsm test cases using 5x5_amazon resolution are failing to run with the following error:

cesm.log

  1 dec0417.hsn.de.hpc.ucar.edu 4: <65-65> is invalid
  2 dec0417.hsn.de.hpc.ucar.edu 4: libnuma: Warning: cpu argument 65-65 is out of range
  3 dec0417.hsn.de.hpc.ucar.edu 4:
  4 dec0417.hsn.de.hpc.ucar.edu 4: usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i <nodes>]
  5 dec0417.hsn.de.hpc.ucar.edu 4:                [--preferred= | -p <node>] [--physcpubind= | -C <cpus>]
  6 dec0417.hsn.de.hpc.ucar.edu 4:                [--cpunodebind= | -N <nodes>] [--membind= | -m <nodes>]
  7 dec0417.hsn.de.hpc.ucar.edu 4:                [--localalloc | -l] command args ...
  8 dec0417.hsn.de.hpc.ucar.edu 4:        numactl [--show | -s]
  9 dec0417.hsn.de.hpc.ucar.edu 4:        numactl [--hardware | -H]
 10 dec0417.hsn.de.hpc.ucar.edu 4:        numactl [--length | -L <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
 11 dec0417.hsn.de.hpc.ucar.edu 4:                [--strict | -t]
 12 dec0417.hsn.de.hpc.ucar.edu 4:                [--shmid | -I <id>] --shm | -S <shmkeyfile>
 13 dec0417.hsn.de.hpc.ucar.edu 4:                [--shmid | -I <id>] --file | -f <tmpfsfile>
 14 dec0417.hsn.de.hpc.ucar.edu 4:                [--huge | -u] [--touch | -T]
 15 dec0417.hsn.de.hpc.ucar.edu 4:                memory policy [--dump | -d] [--dump-nodes | -D]
 16 dec0417.hsn.de.hpc.ucar.edu 4:
 17 dec0417.hsn.de.hpc.ucar.edu 4: memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
 18 dec0417.hsn.de.hpc.ucar.edu 4: <nodes> is a comma delimited list of node numbers or A-B ranges or all.
 19 dec0417.hsn.de.hpc.ucar.edu 4: Instead of a number a node can also be:
 20 dec0417.hsn.de.hpc.ucar.edu 4:   netdev:DEV the node connected to network device DEV
 21 dec0417.hsn.de.hpc.ucar.edu 4:   file:PATH  the node the block device of path is connected to
 22 dec0417.hsn.de.hpc.ucar.edu 4:   ip:HOST    the node of the network device host routes through
 23 dec0417.hsn.de.hpc.ucar.edu 4:   block:PATH the node of block device path
 24 dec0417.hsn.de.hpc.ucar.edu 4:   pci:[seg:]bus:dev[:func] The node of a PCI device
 25 dec0417.hsn.de.hpc.ucar.edu 4: <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
 26 dec0417.hsn.de.hpc.ucar.edu 4: all ranges can be inverted with !
 27 dec0417.hsn.de.hpc.ucar.edu 4: all numbers and ranges can be made cpuset-relative with +
 28 dec0417.hsn.de.hpc.ucar.edu 4: the old --cpubind argument is deprecated.
 29 dec0417.hsn.de.hpc.ucar.edu 4: use --cpunodebind or --physcpubind instead
 30 dec0417.hsn.de.hpc.ucar.edu 4: use --balancing | -b to enable Linux kernel NUMA balancing
 31 dec0417.hsn.de.hpc.ucar.edu 4: for the process if it is supported by kernel
 32 dec0417.hsn.de.hpc.ucar.edu 4: <length> can have g (GB), m (MB) or k (KB) suffixes
 33 dec0417.hsn.de.hpc.ucar.edu 3: <64-64> is invalid
 34 dec0417.hsn.de.hpc.ucar.edu 3: libnuma: Warning: cpu argument 64-64 is out of range
 35 dec0417.hsn.de.hpc.ucar.edu 3:
 36 dec0417.hsn.de.hpc.ucar.edu 3: usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i <nodes>]
 37 dec0417.hsn.de.hpc.ucar.edu 3:                [--preferred= | -p <node>] [--physcpubind= | -C <cpus>]
 38 dec0417.hsn.de.hpc.ucar.edu 3:                [--cpunodebind= | -N <nodes>] [--membind= | -m <nodes>]
 39 dec0417.hsn.de.hpc.ucar.edu 3:                [--localalloc | -l] command args ...
 40 dec0417.hsn.de.hpc.ucar.edu 3:        numactl [--show | -s]
 41 dec0417.hsn.de.hpc.ucar.edu 3:        numactl [--hardware | -H]
 42 dec0417.hsn.de.hpc.ucar.edu 3:        numactl [--length | -L <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
 43 dec0417.hsn.de.hpc.ucar.edu 3:                [--strict | -t]
 44 dec0417.hsn.de.hpc.ucar.edu 3:                [--shmid | -I <id>] --shm | -S <shmkeyfile>
 45 dec0417.hsn.de.hpc.ucar.edu 3:                [--shmid | -I <id>] --file | -f <tmpfsfile>
 46 dec0417.hsn.de.hpc.ucar.edu 3:                [--huge | -u] [--touch | -T]
 47 dec0417.hsn.de.hpc.ucar.edu 3:                memory policy [--dump | -d] [--dump-nodes | -D]
 48 dec0417.hsn.de.hpc.ucar.edu 3:
 49 dec0417.hsn.de.hpc.ucar.edu 3: memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
 50 dec0417.hsn.de.hpc.ucar.edu 3: <nodes> is a comma delimited list of node numbers or A-B ranges or all.
 51 dec0417.hsn.de.hpc.ucar.edu 3: Instead of a number a node can also be:
 52 dec0417.hsn.de.hpc.ucar.edu 3:   netdev:DEV the node connected to network device DEV
 53 dec0417.hsn.de.hpc.ucar.edu 3:   file:PATH  the node the block device of path is connected to
 54 dec0417.hsn.de.hpc.ucar.edu 3:   ip:HOST    the node of the network device host routes through
 55 dec0417.hsn.de.hpc.ucar.edu 3:   block:PATH the node of block device path
 56 dec0417.hsn.de.hpc.ucar.edu 3:   pci:[seg:]bus:dev[:func] The node of a PCI device
 57 dec0417.hsn.de.hpc.ucar.edu 3: <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
 58 dec0417.hsn.de.hpc.ucar.edu 3: all ranges can be inverted with !
 59 dec0417.hsn.de.hpc.ucar.edu 3: all numbers and ranges can be made cpuset-relative with +
 60 dec0417.hsn.de.hpc.ucar.edu 3: the old --cpubind argument is deprecated.
 61 dec0417.hsn.de.hpc.ucar.edu 3: use --cpunodebind or --physcpubind instead
 62 dec0417.hsn.de.hpc.ucar.edu 3: use --balancing | -b to enable Linux kernel NUMA balancing
 63 dec0417.hsn.de.hpc.ucar.edu 3: for the process if it is supported by kernel
 64 dec0417.hsn.de.hpc.ucar.edu 3: <length> can have g (GB), m (MB) or k (KB) suffixes
 65 dec0417.hsn.de.hpc.ucar.edu: rank 3 exited with code 1
 66 dec0417.hsn.de.hpc.ucar.edu: rank 0 died from signal 15

mpibind.log

Chunk info
  1:ncpus=5:mpiprocs=5:ompthreads=1:mem=230GB:Qlist=cpu:ngpus=0
-- -- -- --
MPI exec line:
  mpiexec --label --line-buffer -n 5 -ppn 5 --cpu-bind none -env OMP_NUM_THREADS=1 /glade/u/apps/opt/mpitools/mpibind/cpu_bind /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope-dev173/bld/cesm.exe 
-- -- -- --
Binding Report:
rank: 0, cores: 0-0
rank: 1, cores: 1-1
rank: 3, cores: 64-64
rank: 4, cores: 65-65

roryck commented 7 months ago

Hello Gregory,

Thanks for reporting this. This is actually due to the PBS select line, specifically, because you are only requesting 5 cpus. Under these circumstances, PBS will create a linux cgroup with only 5 cpus, all on the first socket. The mpibind script tries to bind processes across both sockets, to give your job full memory bandwidth, however, core #s > 4 won't exist in the PBS cgroup, hence this failure. So, to get your case to run immediately, try rerunning with 128 CPUs and 5 MPI ranks in the select line, e.g. with something similar to:

#PBS -l select=1:ncpus=128:mpiprocs=5:ompthreads=1:mem=230GB

In general, regardless of how many CPUs you intend to use, you should always request 128 on a derecho node so that you have access to full memory performance.

On the mpibind side, I'll add some code to catch this type of request, and exit gracefully with a more meaningful error message.

Thanks again for the report.

glemieux commented 7 months ago

Thanks for the detailed explanation @roryck.

@jedwards4b, should I add this as an issue to ccs_config_cesm for an update to config_batch.xml?

roryck commented 1 week ago

Resolved with modified PBS select

NCAR / mpibind

binding to certain cores returning `invalid` on `5x5_amazon` resolution #5

cesm.log

mpibind.log