Open giordano opened 4 months ago
It's been a long time since I wrote it, but I'm pretty sure I remember that it does use OpenMP if you build and run it the right way. I don't have access to a Grace-Grace system though. 🤷
The program doesn't scale at all with the number of threads:
$ mpirun -n 144 hybrid_pi
Calculating PI using: 1000000000 slices
144 MPI tasks
1 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159220420161
Time taken: 0.0191897 seconds
$ OMP_NUM_THREADS=2 mpirun -n 72 hybrid_pi
Calculating PI using: 1000000000 slices
72 MPI tasks
2 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159243039744
Time taken: 0.0357466 seconds
$ OMP_NUM_THREADS=72 mpirun -n 2 hybrid_pi
Calculating PI using: 1000000000 slices
2 MPI tasks
72 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.1415926503898
Time taken: 1.24084 seconds
$ OMP_NUM_THREADS=36 mpirun -n 2 hybrid_pi
Calculating PI using: 1000000000 slices
2 MPI tasks
36 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159265038979
Time taken: 1.23933 seconds
$ OMP_NUM_THREADS=2 mpirun -n 2 hybrid_pi
Calculating PI using: 1000000000 slices
2 MPI tasks
2 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159265038982
Time taken: 1.23845 seconds
$ OMP_NUM_THREADS=1 mpirun -n 2 hybrid_pi
Calculating PI using: 1000000000 slices
2 MPI tasks
1 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.1415926503899
Time taken: 1.2388 seconds
Also looking at htop
I see only a number of cores active equal to the number of MPI ranks, not MPI ranks times number of threads.
I've seen this only with this program, the simple c_omp_pi_dir
works fine. For what is worth, this is using GCC 13.2 and OpenMPI 5.0.3:
$ mpicc --version
gcc (GCC) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ mpirun --version
mpirun (Open MPI) 5.0.3
Report bugs to https://www.open-mpi.org/community/help/
Actually, the same program scales as expected when using only OpenMP:
$ OMP_NUM_THREADS=1 ./hybrid_pi
Calculating PI using: 1000000000 slices
1 MPI tasks
1 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159265358997
Time taken: 2.45613 seconds
$ OMP_NUM_THREADS=18 ./hybrid_pi
Calculating PI using: 1000000000 slices
1 MPI tasks
18 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159265358981
Time taken: 0.138604 seconds
$ OMP_NUM_THREADS=36 ./hybrid_pi
Calculating PI using: 1000000000 slices
1 MPI tasks
36 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159265358982
Time taken: 0.0702717 seconds
$ OMP_NUM_THREADS=72 ./hybrid_pi
Calculating PI using: 1000000000 slices
1 MPI tasks
72 OpenMP threads per MPI task
Worker checkins:
Obtained value of PI: 3.14159265358979
Time taken: 0.0368643 seconds
Does one need to do anything special to use both OpenMP and MPI besides combining OMP_NUM_THREADS
and mpirun -n ...
? Or anything I should look into MPI configuration?
Edit: in ompi_info
I see:
$ ompi_info
[...]
Configure command line: '--prefix=/lustre/software/openmpi/grace/gcc13/5.0.3'
'--with-knem=/opt/knem-1.1.4.90mlnx3'
'--with-xpmem=/opt/xpmem' '--without-cuda'
'--enable-mpi1-compatibility' '--disable-debug'
'--without-hcoll' '--enable-mca-no-build=btl-uct'
'--enable-mpi-fortran=all'
'--enable-oshmem-fortran=yes'
'--with-libevent=internal' '--with-hwloc=internal'
'--with-zlib' '--with-pmix=internal'
'--with-prrte=internal'
'--enable-prte-prefix-by-default'
'--with-treematch=yes' '--with-ucx' '--without-ucc'
'--without-ofi' '--enable-ipv6'
'--enable-wrapper-runpath'
[...]
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, Event lib: yes)
[...]
Side note, I believe i
at https://github.com/UCL-RITS/pi_examples/blob/09f685ae96e8abe69d009cb5e461c36c979c4267/c_hybrid_mpi%2Bopenmp_dir/hybrid_pi.c#L10 should be long int
, like https://github.com/UCL-RITS/pi_examples/blob/09f685ae96e8abe69d009cb5e461c36c979c4267/c_pi_dir/pi.c#L6 to get meaningful results when num_steps
is larger than 2 ^ 31.
Try adding --bind-to none
, and checking the defaults in mpirun --help binding
$ OMP_NUM_THREADS=1 mpirun -np 1 ./hybrid_pi | grep Time
Time taken: 5.55262 seconds
$ OMP_NUM_THREADS=2 mpirun -np 1 ./hybrid_pi | grep Time
Time taken: 5.55543 seconds
$ OMP_NUM_THREADS=2 mpirun -np 2 ./hybrid_pi | grep Time
Time taken: 2.77878 seconds
$ OMP_NUM_THREADS=1 mpirun -np 2 ./hybrid_pi | grep Time
Time taken: 2.78362 seconds
$ OMP_NUM_THREADS=2 mpirun -np 2 --bind-to none ./hybrid_pi | grep Time
Time taken: 1.38849 seconds
Agreed on the long int
thing.
Ah, --bind-to none
does the trick, thanks!
$ OMP_NUM_THREADS=72 mpirun -n 2 ./hybrid_pi | grep Time
Time taken: 4.15364 seconds
$ OMP_NUM_THREADS=72 mpirun -n 2 --bind-to none ./hybrid_pi | grep Time
Time taken: 0.504521 seconds
I presume that option is OpenMPI-specific? I didn't see it in MPICH mpiexec
man page 🥲
OpenMPI's and MPICH's mpirun
and mpiexec
barely have any options in common.
I haven't dug into the code, but I noticed that running the
c_hybrid_mpi+openmp_dir
example on a Grace-Grace system it doesn't seem to use multiple threads, setting the environment variableOMP_NUM_THREADS
doesn't have any effect.