firemodels / fds

Fire Dynamics Simulator
https://pages.nist.gov/fds-smv/
Other
673 stars 627 forks source link

Can we use different number of OpenMP threads per MPI process #13408

Open mcgratta opened 2 months ago

mcgratta commented 2 months ago

Here is a simple Hello World program that uses both OpenMP and MPI. Can we run this with a different number of OpenMP threads per MPI process.

program HelloWorld

USE mpi

integer :: MyRank, Numprocs
integer :: status(MPI_STATUS_SIZE)
integer :: ThreadID, OMP_GET_THREAD_NUM
character(MPI_MAX_PROCESSOR_NAME) :: PNAME
INTEGER :: PNAMELEN=0

call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, Numprocs, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, MyRank, ierror)
call MPI_GET_PROCESSOR_NAME(PNAME, PNAMELEN, ierror)

!$OMP PARALLEL PRIVATE(ThreadID)
ThreadID = OMP_GET_THREAD_NUM()
write(0,'(a,i0,3a,i0)') "Hello World From MPI Process ", MyRank," on ",trim(PNAME),"; OpenMP Thread ", ThreadID
!$OMP END PARALLEL

call MPI_FINALIZE( ierror )

stop
end
cxp484 commented 2 months ago

Thanks. I will try it.

marcosvanella commented 2 months ago

We can set different number of threads per MPI process calling omp_set_num_threads(my_rank_nthr). see below: https://www.openmp.org/spec-html/5.0/openmpsu110.html

mcgratta commented 2 months ago

Sure, but the question is -- how do you tell Slurm to provide a different number of cores to the MPI processes. Slurm cannot wait until FDS starts to allocate the extra cores to certain MPI processes.

drjfloyd commented 2 months ago

I think this is what we need:

https://slurm.schedmd.com/heterogeneous_jobs.html

cxp484 commented 2 months ago

I tried with several options in spark, including heterogeneous job submissions. The following configuration worked for me (see the attached code and job submission script):

mpirun -np 2 -env OMP_NUM_THREADS=4  ./mpi_openmp_test : \
       -np 1 -env OMP_NUM_THREADS=16 ./mpi_openmp_test : \
       -np 3 -env OMP_NUM_THREADS=2  ./mpi_openmp_test

The output is as follows:

 MYRANK=           0  NUM_OF_THREADS=           4
 MYRANK=           1  NUM_OF_THREADS=           4
 MYRANK=           2  NUM_OF_THREADS=          16
 MYRANK=           3  NUM_OF_THREADS=           2
 MYRANK=           4  NUM_OF_THREADS=           2
 MYRANK=           5  NUM_OF_THREADS=           2

It appears that the MPI rank (RANK_ID) is assigned sequentially based on the order of processes in the mpirun command. This means that if you know which MPI process requires more threads, you can simply specify it directly in the mpirun command in appropriate order.

MPI_OpenMP.tar.gz

mcgratta commented 2 months ago

How do you know that you have been allocated 30 cores?

cxp484 commented 2 months ago

I am not sure. Let me check if there is a command to know how many cores are allocated to a job.

drjfloyd commented 2 months ago

You can look at the .log file. For example in race_test_4.log I get this. One MPI job pinned to 4 cpus.

[0] MPI startup(): ===== CPU pinning ===== [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 1467670 pele-09 2,3,33,53

mcgratta commented 2 months ago

This is what the pinning looks like for the sample case

[0] MPI startup(): ===== CPU pinning =====
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       2343160  spark002   {5}
[0] MPI startup(): 1       2343161  spark002   {6}
[0] MPI startup(): 2       2343162  spark002   {21}
[0] MPI startup(): 3       2343163  spark002   {22}
[0] MPI startup(): 4       2343164  spark002   {37}
[0] MPI startup(): 5       2343165  spark002   {53}

I assume that we only get 1 core per MPI process. The OpenMP threads are, I suppose, crammed onto a single core.

cxp484 commented 2 months ago

Thanks, Jason and Kevin. CPU pinning is a great idea, and I'm currently testing it. I've also added a C code that retrieves the CPUID at runtime (using the sched_getcpu() system call) to show which CPU a particular thread is executing on. The code is attached.

Test 1: 1 MPI, 4 Threads, only 1 CPU requested through ntasks, and I_MPI_PIN_DOMAIN is not set. As expected all threads are getting executed in CPU 0.

Input:
#SBATCH --ntasks=1
mpirun -np 1 -env OMP_NUM_THREADS=4  ./mpi_openmp_test

Output:
Rank    Pid      Node name  Pin cpu
0       2769504  spark001   {0}

MPI PROCESS 0 ON spark001; OPENMP THREAD 0; CPUID 0
MPI PROCESS 0 ON spark001; OPENMP THREAD 3; CPUID 0
MPI PROCESS 0 ON spark001; OPENMP THREAD 2; CPUID 0
MPI PROCESS 0 ON spark001; OPENMP THREAD 1; CPUID 0

Test 2: 1 MPI, 4 Threads, only 1 CPU requested through ntasks, I_MPI_PIN_DOMAIN=omp. The output is exactly same as above, as ntasks has 1 CPU, so it the scheduler can't allocate more CPUs to the job.

Input:
#SBATCH --ntasks=1
I_MPI_PIN_DOMAIN=omp
mpirun -np 1 -env OMP_NUM_THREADS=4  ./mpi_openmp_test

Test 3: 1 MPI, 4 Threads, 4 CPU requested through ntasks, I_MPI_PIN_DOMAIN=omp. Now the scheduler is allocating 4 different CPUs to the job, but interestingly two of the threads got executed in CPUID 32. Some times 4 individual CPU's are also used (which I expect to be normal). But sometimes three of the thread used same CPU ID's. I am not sure about this behavior. Note, I have an infinite do-while loop in each of the thread. So, it is not that one thread finished the job and release the CPU to other.

Input:
#SBATCH --ntasks=4
I_MPI_PIN_DOMAIN=omp
mpirun -np 1 -env OMP_NUM_THREADS=4  ./mpi_openmp_test

Output:
Rank    Pid      Node name  Pin cpu
0       2769749  spark001   {0,16,32,48}

MPI PROCESS 0 ON spark001; OPENMP THREAD 2; CPUID 16
MPI PROCESS 0 ON spark001; OPENMP THREAD 0; CPUID 48
MPI PROCESS 0 ON spark001; OPENMP THREAD 1; CPUID 32
MPI PROCESS 0 ON spark001; OPENMP THREAD 3; CPUID 32

Test 4: 1 MPI, 8 Threads, 8 CPU requested through ntasks, I_MPI_PIN_DOMAIN=omp. The behavior is exactly same as Test 3.

Input:
#SBATCH --ntasks=8
I_MPI_PIN_DOMAIN=omp
mpirun -np 1 -env OMP_NUM_THREADS=8  ./mpi_openmp_test

Output:
Rank    Pid      Node name  Pin cpu
 0       2770372  spark001   {0,1,16,17,32,33,48,49}

MPI PROCESS 0 ON spark001; OPENMP THREAD 4; CPUID 49
MPI PROCESS 0 ON spark001; OPENMP THREAD 7; CPUID 48
MPI PROCESS 0 ON spark001; OPENMP THREAD 6; CPUID 48
MPI PROCESS 0 ON spark001; OPENMP THREAD 3; CPUID 48
MPI PROCESS 0 ON spark001; OPENMP THREAD 2; CPUID 33
MPI PROCESS 0 ON spark001; OPENMP THREAD 1; CPUID 49
MPI PROCESS 0 ON spark001; OPENMP THREAD 0; CPUID 49
MPI PROCESS 0 ON spark001; OPENMP THREAD 5; CPUID 48

Now come to multiple mpi processes with different threads: Test 5: 6 MPI, 30 totalThreads, 30 CPU requested through ntasks, I_MPI_PIN_DOMAIN=omp. I sorted the output by MPI process. The issue with mpirun is that during pinning, it only reads the OMP_NUM_THREADS variable from the first process. As a result, MPI processes with 16 and 2 threads are also allocated only 4 CPUs.

Input:
#SBATCH --ntasks=30
I_MPI_PIN_DOMAIN=omp
mpirun  -np 2 -env OMP_NUM_THREADS=4 ./mpi_openmp_test : \
        -np 1 -env OMP_NUM_THREADS=16 ./mpi_openmp_test : \
        -np 3 -env OMP_NUM_THREADS=2  ./mpi_openmp_test

Output:
 Rank    Pid      Node name  Pin cpu
 0       2770482  spark001   {2,3,4,5}
 1       2770483  spark001   {6,7,8,9}
 2       2770484  spark001   {18,19,20,21}
 3       2770485  spark001   {22,23,24,25}
 4       2770486  spark001   {34,35,36,37}
 5       2770487  spark001   {38,39,40,50}

MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 5
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 2
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 2 | CPUID | 4
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 3 | CPUID | 3
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 9
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 6
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 2 | CPUID | 8
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 3 | CPUID | 7
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 21
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 18
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 2 | CPUID | 20
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 3 | CPUID | 19
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 4 | CPUID | 21
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 5 | CPUID | 21
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 6 | CPUID | 19
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 7 | CPUID | 21
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 8 | CPUID | 20
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 9 | CPUID | 19
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 10 | CPUID | 18
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 11 | CPUID | 20
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 12 | CPUID | 18
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 13 | CPUID | 18
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 14 | CPUID | 20
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 15 | CPUID | 19
MPI | PROCESS | 3 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 25
MPI | PROCESS | 3 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 22
MPI | PROCESS | 4 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 37
MPI | PROCESS | 4 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 34
MPI | PROCESS | 5 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 50
MPI | PROCESS | 5 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 50

Test 6: 3 MPI, 30 totalThreads, 30 CPU requested through ntasks, I_MPI_PIN_DOMAIN=omp. But used taskset to directly allocate the CPU's. Now, MPI process 2 is allocated exactly 16 CPUs, processes 0 and 1 are allocated 8 CPUs (shared), and processes 3, 4, and 5 are allocated 6 CPUs (shared). However, during execution, I notice that different threads are using the same CPUs, even when others are available. I am unable to explain this behavior.

Input:
#SBATCH --ntasks=30
I_MPI_PIN_DOMAIN=omp
mpirun  -np 2 -env OMP_NUM_THREADS=4   taskset -c 0-7  ./mpi_openmp_test : \
        -np 1 -env OMP_NUM_THREADS=16  taskset -c 8-23  ./mpi_openmp_test : \
        -np 3 -env OMP_NUM_THREADS=2   taskset -c 24-29  ./mpi_openmp_test

Output:
Rank    Pid      Node name  Pin cpu
0       2770661  spark001   {0,1,2,3,4,5,6,7}
1       2770662  spark001   {0,1,2,3,4,5,6,7}
2       2770663  spark001   {8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
3       2770664  spark001   {24,25,26,27,28,29}
4       2770665  spark001   {24,25,26,27,28,29}
5       2770666  spark001   {24,25,26,27,28,29}

MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 0
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 1
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 2 | CPUID | 1
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 3 | CPUID | 1
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 7
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 0
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 2 | CPUID | 1
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 3 | CPUID | 3
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 23
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 12
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 2 | CPUID | 23
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 3 | CPUID | 23
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 4 | CPUID | 17
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 5 | CPUID | 17
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 6 | CPUID | 17
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 7 | CPUID | 17
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 8 | CPUID | 16
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 9 | CPUID | 16
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 10 | CPUID | 16
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 11 | CPUID | 16
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 12 | CPUID | 15
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 13 | CPUID | 11
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 14 | CPUID | 10
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 15 | CPUID | 14
MPI | PROCESS | 3 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 29
MPI | PROCESS | 3 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 28
MPI | PROCESS | 4 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 29
MPI | PROCESS | 4 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 26
MPI | PROCESS | 5 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 29
MPI | PROCESS | 5 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 27

MPI_OpenMP.zip

mcgratta commented 2 months ago

Why not this:

Input:
#SBATCH --ntasks=30
I_MPI_PIN_DOMAIN=omp
mpirun  -np 1 -env OMP_NUM_THREADS=4   taskset -c 0-3     ./mpi_openmp_test : \
             -np 1 -env OMP_NUM_THREADS=4   taskset -c 4-7     ./mpi_openmp_test : \
             -np 1 -env OMP_NUM_THREADS=16  taskset -c 8-23  ./mpi_openmp_test : \
             -np 1 -env OMP_NUM_THREADS=2   taskset -c 24-25  ./mpi_openmp_test : \
             -np 1 -env OMP_NUM_THREADS=2   taskset -c 26-27  ./mpi_openmp_test : \
             -np 1 -env OMP_NUM_THREADS=2   taskset -c 28-29  ./mpi_openmp_test : \
cxp484 commented 2 months ago

Why not this:

Input:
#SBATCH --ntasks=30
I_MPI_PIN_DOMAIN=omp
mpirun  -np 1 -env OMP_NUM_THREADS=4   taskset -c 0-3     ./mpi_openmp_test : \
             -np 1 -env OMP_NUM_THREADS=4   taskset -c 4-7     ./mpi_openmp_test : \
             -np 1 -env OMP_NUM_THREADS=16  taskset -c 8-23  ./mpi_openmp_test : \
             -np 1 -env OMP_NUM_THREADS=2   taskset -c 24-25  ./mpi_openmp_test : \
             -np 1 -env OMP_NUM_THREADS=2   taskset -c 26-27  ./mpi_openmp_test : \
             -np 1 -env OMP_NUM_THREADS=2   taskset -c 28-29  ./mpi_openmp_test : \

Yes, that's perfectly fine. It simply means you'll need to specify each MPI task individually. When you require a single MPI process with a different number of threads, I thought it would be easier to just specify that process separately, rest can be grouped. :)

mcgratta commented 2 months ago

For the moment, this is going to be a "special case". I don't want to work this into qfds.sh, for example. My thought would be to use qfds.sh to create a basic script that can be modified. Then I'd like to see if the detailed chemistry cases can be run by assigning more CPUs to the meshes that need it.

cxp484 commented 2 months ago

For the moment, this is going to be a "special case". I don't want to work this into qfds.sh, for example. My thought would be to use qfds.sh to create a basic script that can be modified. Then I'd like to see if the detailed chemistry cases can be run by assigning more CPUs to the meshes that need it.

Agreed. Marcos and I will work on making the chemistry call thread-safe. It seems quite possible.