`c_hybrid_mpi+openmp_dir` doesn't seem to actually use OpenMP

giordano commented 4 months ago

I haven't dug into the code, but I noticed that running the c_hybrid_mpi+openmp_dir example on a Grace-Grace system it doesn't seem to use multiple threads, setting the environment variable OMP_NUM_THREADS doesn't have any effect.

ikirker commented 4 months ago

It's been a long time since I wrote it, but I'm pretty sure I remember that it does use OpenMP if you build and run it the right way. I don't have access to a Grace-Grace system though. 🤷

giordano commented 4 months ago

The program doesn't scale at all with the number of threads:

$ mpirun -n 144 hybrid_pi
Calculating PI using:  1000000000 slices
  144 MPI tasks
  1 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159220420161
Time taken: 0.0191897 seconds
$ OMP_NUM_THREADS=2 mpirun -n 72 hybrid_pi
Calculating PI using:  1000000000 slices
  72 MPI tasks
  2 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159243039744
Time taken: 0.0357466 seconds
$ OMP_NUM_THREADS=72 mpirun -n 2 hybrid_pi
Calculating PI using:  1000000000 slices
  2 MPI tasks
  72 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.1415926503898
Time taken: 1.24084 seconds
$ OMP_NUM_THREADS=36 mpirun -n 2 hybrid_pi
Calculating PI using:  1000000000 slices
  2 MPI tasks
  36 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159265038979
Time taken: 1.23933 seconds
$ OMP_NUM_THREADS=2 mpirun -n 2 hybrid_pi
Calculating PI using:  1000000000 slices
  2 MPI tasks
  2 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159265038982
Time taken: 1.23845 seconds
$ OMP_NUM_THREADS=1 mpirun -n 2 hybrid_pi
Calculating PI using:  1000000000 slices
  2 MPI tasks
  1 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.1415926503899
Time taken: 1.2388 seconds

Also looking at htop I see only a number of cores active equal to the number of MPI ranks, not MPI ranks times number of threads.

I've seen this only with this program, the simple c_omp_pi_dir works fine. For what is worth, this is using GCC 13.2 and OpenMPI 5.0.3:

$ mpicc --version
gcc (GCC) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ mpirun --version
mpirun (Open MPI) 5.0.3

Report bugs to https://www.open-mpi.org/community/help/

giordano commented 4 months ago

Actually, the same program scales as expected when using only OpenMP:

$ OMP_NUM_THREADS=1 ./hybrid_pi
Calculating PI using:  1000000000 slices
  1 MPI tasks
  1 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159265358997
Time taken: 2.45613 seconds
$ OMP_NUM_THREADS=18 ./hybrid_pi
Calculating PI using:  1000000000 slices
  1 MPI tasks
  18 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159265358981
Time taken: 0.138604 seconds
$ OMP_NUM_THREADS=36 ./hybrid_pi
Calculating PI using:  1000000000 slices
  1 MPI tasks
  36 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159265358982
Time taken: 0.0702717 seconds
$ OMP_NUM_THREADS=72 ./hybrid_pi
Calculating PI using:  1000000000 slices
  1 MPI tasks
  72 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159265358979
Time taken: 0.0368643 seconds

Does one need to do anything special to use both OpenMP and MPI besides combining OMP_NUM_THREADS and mpirun -n ...? Or anything I should look into MPI configuration?

Edit: in ompi_info I see:

$ ompi_info
[...]
  Configure command line: '--prefix=/lustre/software/openmpi/grace/gcc13/5.0.3'
                          '--with-knem=/opt/knem-1.1.4.90mlnx3'
                          '--with-xpmem=/opt/xpmem' '--without-cuda'
                          '--enable-mpi1-compatibility' '--disable-debug'
                          '--without-hcoll' '--enable-mca-no-build=btl-uct'
                          '--enable-mpi-fortran=all'
                          '--enable-oshmem-fortran=yes'
                          '--with-libevent=internal' '--with-hwloc=internal'
                          '--with-zlib' '--with-pmix=internal'
                          '--with-prrte=internal'
                          '--enable-prte-prefix-by-default'
                          '--with-treematch=yes' '--with-ucx' '--without-ucc'
                          '--without-ofi' '--enable-ipv6'
                          '--enable-wrapper-runpath'
[...]
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, Event lib: yes)
[...]

giordano commented 4 months ago

Side note, I believe i at https://github.com/UCL-RITS/pi_examples/blob/09f685ae96e8abe69d009cb5e461c36c979c4267/c_hybrid_mpi%2Bopenmp_dir/hybrid_pi.c#L10 should be long int, like https://github.com/UCL-RITS/pi_examples/blob/09f685ae96e8abe69d009cb5e461c36c979c4267/c_pi_dir/pi.c#L6 to get meaningful results when num_steps is larger than 2 ^ 31.

ikirker commented 4 months ago

Try adding --bind-to none, and checking the defaults in mpirun --help binding

$ OMP_NUM_THREADS=1 mpirun -np 1 ./hybrid_pi | grep Time
Time taken: 5.55262 seconds
$ OMP_NUM_THREADS=2 mpirun -np 1 ./hybrid_pi | grep Time
Time taken: 5.55543 seconds
$ OMP_NUM_THREADS=2 mpirun -np 2 ./hybrid_pi | grep Time
Time taken: 2.77878 seconds
$ OMP_NUM_THREADS=1 mpirun -np 2 ./hybrid_pi | grep Time
Time taken: 2.78362 seconds
$ OMP_NUM_THREADS=2 mpirun -np 2 --bind-to none ./hybrid_pi | grep Time
Time taken: 1.38849 seconds

Agreed on the long int thing.

giordano commented 4 months ago

Ah, --bind-to none does the trick, thanks!

$ OMP_NUM_THREADS=72 mpirun -n 2 ./hybrid_pi | grep Time
Time taken: 4.15364 seconds
$ OMP_NUM_THREADS=72 mpirun -n 2 --bind-to none ./hybrid_pi | grep Time
Time taken: 0.504521 seconds

I presume that option is OpenMPI-specific? I didn't see it in MPICH mpiexec man page 🥲

ikirker commented 4 months ago

OpenMPI's and MPICH's mpirun and mpiexec barely have any options in common.

UCL-RITS / pi_examples

`c_hybrid_mpi+openmp_dir` doesn't seem to actually use OpenMP #21