geoschem / geos-chem

GEOS-Chem "Science Codebase" repository. Contains GEOS-Chem science routines, run directory generation scripts, and interface code. This repository is used as a submodule within the GCClassic and GCHP wrappers, as well as in other modeling contexts (external ESMs).
http://geos-chem.org
Other
168 stars 165 forks source link

[DISCUSSION] Hyperthreading and OpenMP parallelization (only applies to GEOS-Chem "Classic") #614

Closed yantosca closed 3 years ago

yantosca commented 3 years ago

Overview

I have been trying to do some profiling of GEOS-Chem classic on a SLURM partition consisting of Cascade Lake processors. Each node has 24 physical cores but hyperthreading is activated (i.e. each has 2 logical cores), so there are 48 possible computational cores per node. However, it seems that the hyperthreading is interfering with attempts to profile the code with the TAU Performance Profiler (i.e. inaccurate results are obtained).

Based on some advice from FAS Research Computing, I've tried to set up runs on the partition such that each OpenMP thread binds to a physical core (and not a logical core). Cores 0-23 on Cascade Lake are on the physical CPUs, and cores 24-47 are the hyperthreaded cores.

In OpenMP 4.5 and higher you should be able to bind OpenMP threads to physical CPUs with

setenv OMP_NUM_THREADS=24
setenv OMP_PLACES=cores

but in my experience I find that this doesn't work well.

Another way to do this is to set the OMP_CPU_AFFINITY environment variable to tell which cores you want to use (i.e. make sure you only use core numbers 0-23 and skip 24-47). I tried the following setups below:

Experiments

1. Gfortran 10.2 with environment variable OMP_CPU_AFFINITY="0:23"

Compiler

[holyjacob01 refrun7]$ gfortran --version
GNU Fortran (Spack GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Runscript commands

#!/bin/bash

#SBATCH -c 24
#SBATCH -N 1
#SBATCH -t 0-00:10
#SBATCH -p huce_cascade
#SBATCH --mem=15000
#SBATCH --mail-type=END

... etc ...

# Specify number of OpenMP threads (one per core)
export OMP_DISPLAY_ENV=true
export OMP_DISPLAY_AFFINITY=true
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
let THREADS_MINUS_ONE=$OMP_NUM_THREADS-1
export OMP_CPU_AFFINITY="0-$THREADS_MINUS_ONE"

... etc ...

Logfile output

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '24'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'FALSE'
  OMP_PLACES = ''
  OMP_STACKSIZE = '524288000'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'TRUE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
OPENMP DISPLAY ENVIRONMENT END

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201611'
  [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
  [host] OMP_ALLOCATOR='omp_default_mem_alloc'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEBUG='disabled'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_AFFINITY='TRUE'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS='24'
  [host] OMP_PLACES: value is not defined
  [host] OMP_PROC_BIND='false'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='500M'
  [host] OMP_TARGET_OFFLOAD=DEFAULT
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_TOOL='enabled'
  [host] OMP_TOOL_LIBRARIES: value is not defined
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END

OMP: pid 34600 tid 34600 thread 0 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34601 thread 1 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34604 thread 4 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34608 thread 8 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34612 thread 12 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34606 thread 6 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34610 thread 10 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34611 thread 11 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34605 thread 5 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34613 thread 13 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34603 thread 3 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34614 thread 14 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34602 thread 2 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34609 thread 9 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34616 thread 16 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34618 thread 18 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34617 thread 17 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34607 thread 7 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34615 thread 15 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34619 thread 19 bound to OS proc set {4,11,17-38}
OMP: pid 34600 tid 34620 thread 20 bound to OS proc set {4,11,17-38}

Analysis

I'm not sure if this is exactly what we want. It seems like we are using the hyperthreaded cores 24-38. (Or it could be that these are all physical cores and that forever reason the numbering scheme is not what we would expect.)

It seems that there is another environment variable we can try: instead of OMP_CPU_AFFINITY we can set GOMP_CPU_AFFINITY. GOMP is the GNU OpenMP library that is bundled with GCC and GFortran compilers. Let's see if this makes a difference.

2. Gfortran 10.2 with GOMP_CPU_AFFINITY="0-23"

Compiler

[holyjacob01 refrun7]$ gfortran --version
GNU Fortran (Spack GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Run script commands:

#!/bin/bash

#SBATCH -c 24
#SBATCH -N 1
#SBATCH -t 0-00:10
#SBATCH -p huce_cascade
#SBATCH --mem=15000
#SBATCH --mail-type=END

... etc ...

# Specify number of OpenMP threads (one per core)
export OMP_DISPLAY_ENV=true
export OMP_DISPLAY_AFFINITY=true
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
let THREADS_MINUS_ONE=$OMP_NUM_THREADS-1
export GOMP_CPU_AFFINITY="0-$THREADS_MINUS_ONE"

... etc ...

Logfile output:

OMP: Warning #123: Ignoring invalid OS proc ID 0.
OMP: Warning #123: Ignoring invalid OS proc ID 1.
OMP: Warning #123: Ignoring invalid OS proc ID 2.
OMP: Warning #123: Ignoring invalid OS proc ID 3.
OMP: Warning #123: Ignoring invalid OS proc ID 5.
OMP: Warning #123: Ignoring invalid OS proc ID 6.
OMP: Warning #123: Ignoring invalid OS proc ID 7.
OMP: Warning #123: Ignoring invalid OS proc ID 8.
OMP: Warning #123: Ignoring invalid OS proc ID 9.
OMP: Warning #123: Ignoring invalid OS proc ID 10.
OMP: Warning #123: Ignoring invalid OS proc ID 12.
OMP: Warning #123: Ignoring invalid OS proc ID 13.
OMP: Warning #123: Ignoring invalid OS proc ID 14.
OMP: Warning #123: Ignoring invalid OS proc ID 15.
OMP: Warning #123: Ignoring invalid OS proc ID 16.

... etc ...

OMP: pid 35790 tid 35790 thread 0 bound to OS proc set {4}
OMP: pid 35790 tid 35791 thread 1 bound to OS proc set {11}
OMP: pid 35790 tid 35794 thread 4 bound to OS proc set {19}
OMP: pid 35790 tid 35798 thread 8 bound to OS proc set {23}
OMP: pid 35790 tid 35796 thread 6 bound to OS proc set {21}
OMP: pid 35790 tid 35799 thread 9 bound to OS proc set {4}
OMP: pid 35790 tid 35802 thread 12 bound to OS proc set {18}
OMP: pid 35790 tid 35797 thread 7 bound to OS proc set {22}
OMP: pid 35790 tid 35803 thread 13 bound to OS proc set {19}
OMP: pid 35790 tid 35805 thread 15 bound to OS proc set {21}
OMP: pid 35790 tid 35792 thread 2 bound to OS proc set {17}
OMP: pid 35790 tid 35795 thread 5 bound to OS proc set {20}
OMP: pid 35790 tid 35793 thread 3 bound to OS proc set {18}
OMP: pid 35790 tid 35801 thread 11 bound to OS proc set {17}
OMP: pid 35790 tid 35804 thread 14 bound to OS proc set {20}
OMP: pid 35790 tid 35808 thread 18 bound to OS proc set {4}
OMP: pid 35790 tid 35800 thread 10 bound to OS proc set {11}
OMP: pid 35790 tid 35806 thread 16 bound to OS proc set {22}
OMP: pid 35790 tid 35810 thread 20 bound to OS proc set {17}
OMP: pid 35790 tid 35807 thread 17 bound to OS proc set {23}
OMP: pid 35790 tid 35811 thread 21 bound to OS proc set {18}
OMP: pid 35790 tid 35809 thread 19 bound to OS proc set {11}
OMP: pid 35790 tid 35812 thread 22 bound to OS proc set {19}
OMP: pid 35790 tid 35813 thread 23 bound to OS proc set {20}

Analysis

  1. The run progresses at normal speed
  2. Each OpenMP thread is bound to a physical CPU (in the range 0-23) a. But more than one thread is bound to a given CPU?
  3. The duplicated output to the log files that we see if we take the entire node does not occure.
  4. I'm also not sure what those invalid Proc ID's are. Maybe those are actually the hyperthreaded cores.

Each thread is bound to a CPU but there are multiple threads/CPU in the range 0-23.

3. ifort 19.0.5 with OMP_CPU_AFFINITY="0-23"

Compiler:

[holyjacob01 devrun7]$ ifort --version
ifort (IFORT) 19.0.5.281 20190815
Copyright (C) 1985-2019 Intel Corporation.  All rights reserved.

Job file commands:

#!/bin/bash

#SBATCH -c 24
#SBATCH -N 1
#SBATCH -t 0-02:30
#SBATCH -p huce_cascade
#SBATCH --mem=15000
#SBATCH --mail-type=END

... etc ...

# Specify number of OpenMP threads (one per core)
export OMP_DISPLAY_ENV=true
export OMP_DISPLAY_AFFINITY=true
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

#----------------------------------------------------------------------------
# Bind OpenMP threads to physical CPU cores (i.e. don't use hyperthreading)
let THREADS_MINUS_ONE=$OMP_NUM_THREADS-1
if [[ $FC =~ "gfortran" ]]; then
    export GOMP_CPU_AFFINITY="0-$THREADS_MINUS_ONE"
    echo "GOMP_CPU_AFFINITY: $GOMP_CPU_AFFINITY"
else 
    export OMP_CPU_AFFINITY="0-$THREADS_MINUS_ONE" 
    echo "OMP_CPU_AFFINITY: $OMP_CPU_AFFINITY"
fi
#----------------------------------------------------------------------------

Logfile output:

OMP_CPU_AFFINITY: 0-23

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '24'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'FALSE'
  OMP_PLACES = ''
  OMP_STACKSIZE = '524288000'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'TRUE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
OPENMP DISPLAY ENVIRONMENT END

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201611'
  [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
  [host] OMP_ALLOCATOR='omp_default_mem_alloc'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEBUG='disabled'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_AFFINITY='TRUE'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS='24'
  [host] OMP_PLACES: value is not defined
  [host] OMP_PROC_BIND='false'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='500M'
  [host] OMP_TARGET_OFFLOAD=DEFAULT
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_TOOL='enabled'
  [host] OMP_TOOL_LIBRARIES: value is not defined
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END

OMP: pid 83822 tid 83822 thread 0 bound to OS proc set {0-23}
OMP: pid 83822 tid 83823 thread 1 bound to OS proc set {0-23}
OMP: pid 83822 tid 83826 thread 4 bound to OS proc set {0-23}
OMP: pid 83822 tid 83829 thread 7 bound to OS proc set {0-23}
OMP: pid 83822 tid 83827 thread 5 bound to OS proc set {0-23}
OMP: pid 83822 tid 83825 thread 3 bound to OS proc set {0-23}
OMP: pid 83822 tid 83830 thread 8 bound to OS proc set {0-23}
OMP: pid 83822 tid 83833 thread 11 bound to OS proc set {0-23}
OMP: pid 83822 tid 83834 thread 12 bound to OS proc set {0-23}
OMP: pid 83822 tid 83824 thread 2 bound to OS proc set {0-23}
OMP: pid 83822 tid 83831 thread 9 bound to OS proc set {0-23}
OMP: pid 83822 tid 83828 thread 6 bound to OS proc set {0-23}
OMP: pid 83822 tid 83838 thread 16 bound to OS proc set {0-23}
OMP: pid 83822 tid 83841 thread 19 bound to OS proc set {0-23}
OMP: pid 83822 tid 83840 thread 18 bound to OS proc set {0-23}
OMP: pid 83822 tid 83845 thread 23 bound to OS proc set {0-23}
OMP: pid 83822 tid 83842 thread 20 bound to OS proc set {0-23}
OMP: pid 83822 tid 83843 thread 21 bound to OS proc set {0-23}
OMP: pid 83822 tid 83839 thread 17 bound to OS proc set {0-23}
OMP: pid 83822 tid 83844 thread 22 bound to OS proc set {0-23}
OMP: pid 83822 tid 83836 thread 14 bound to OS proc set {0-23}
OMP: pid 83822 tid 83835 thread 13 bound to OS proc set {0-23}
OMP: pid 83822 tid 83837 thread 15 bound to OS proc set {0-23}
OMP: pid 83822 tid 83832 thread 10 bound to OS proc set {0-23}

Analysis:

This seems to do what we want.

Recommendation:

It seems that using

export GOMP_CPU_AFFINITY="0-23"

for GNU Fortran will avoid the hyperthreading cores. (We should really also confirm this)

Also

export OMP_CPU_AFFINITY="0-23"

seems to avoid the hyperthreading cores with Intel Fortran on the Cascade Lake partition.

I will rerun the profiles with these commands to see if the profiling output makes more sense.

yantosca commented 3 years ago

As it turns out, the results above were not an apples-to-apples comparison. The KPP integrator routine in the Dev code was calling UPDATE_RCONST (i.e. the rate law update function) within the integrator on each integration step. Although the time to compute the reaction rates improved, the overall integration time increased.

A new test has been done that shows removing the useless computations in the rate law functions actually decrease the time by about 4.5 minutes per 7 day simulation (or almost 20 minutes for a 31-day benchmark simualtion). See this link: https://github.com/geoschem/geos-chem/issues/598#issuecomment-778237911