Open mcgratta opened 2 months ago
Thanks. I will try it.
We can set different number of threads per MPI process calling omp_set_num_threads(my_rank_nthr). see below: https://www.openmp.org/spec-html/5.0/openmpsu110.html
Sure, but the question is -- how do you tell Slurm to provide a different number of cores to the MPI processes. Slurm cannot wait until FDS starts to allocate the extra cores to certain MPI processes.
I think this is what we need:
I tried with several options in spark, including heterogeneous job submissions. The following configuration worked for me (see the attached code and job submission script):
mpirun -np 2 -env OMP_NUM_THREADS=4 ./mpi_openmp_test : \
-np 1 -env OMP_NUM_THREADS=16 ./mpi_openmp_test : \
-np 3 -env OMP_NUM_THREADS=2 ./mpi_openmp_test
The output is as follows:
MYRANK= 0 NUM_OF_THREADS= 4
MYRANK= 1 NUM_OF_THREADS= 4
MYRANK= 2 NUM_OF_THREADS= 16
MYRANK= 3 NUM_OF_THREADS= 2
MYRANK= 4 NUM_OF_THREADS= 2
MYRANK= 5 NUM_OF_THREADS= 2
It appears that the MPI rank (RANK_ID) is assigned sequentially based on the order of processes in the mpirun command. This means that if you know which MPI process requires more threads, you can simply specify it directly in the mpirun command in appropriate order.
How do you know that you have been allocated 30 cores?
I am not sure. Let me check if there is a command to know how many cores are allocated to a job.
You can look at the .log file. For example in race_test_4.log I get this. One MPI job pinned to 4 cpus.
[0] MPI startup(): ===== CPU pinning ===== [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 1467670 pele-09 2,3,33,53
This is what the pinning looks like for the sample case
[0] MPI startup(): ===== CPU pinning =====
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 2343160 spark002 {5}
[0] MPI startup(): 1 2343161 spark002 {6}
[0] MPI startup(): 2 2343162 spark002 {21}
[0] MPI startup(): 3 2343163 spark002 {22}
[0] MPI startup(): 4 2343164 spark002 {37}
[0] MPI startup(): 5 2343165 spark002 {53}
I assume that we only get 1 core per MPI process. The OpenMP threads are, I suppose, crammed onto a single core.
Thanks, Jason and Kevin. CPU pinning is a great idea, and I'm currently testing it. I've also added a C code that retrieves the CPUID at runtime (using the sched_getcpu() system call) to show which CPU a particular thread is executing on. The code is attached.
Test 1: 1 MPI, 4 Threads, only 1 CPU requested through ntasks, and I_MPI_PIN_DOMAIN is not set. As expected all threads are getting executed in CPU 0.
Input:
#SBATCH --ntasks=1
mpirun -np 1 -env OMP_NUM_THREADS=4 ./mpi_openmp_test
Output:
Rank Pid Node name Pin cpu
0 2769504 spark001 {0}
MPI PROCESS 0 ON spark001; OPENMP THREAD 0; CPUID 0
MPI PROCESS 0 ON spark001; OPENMP THREAD 3; CPUID 0
MPI PROCESS 0 ON spark001; OPENMP THREAD 2; CPUID 0
MPI PROCESS 0 ON spark001; OPENMP THREAD 1; CPUID 0
Test 2: 1 MPI, 4 Threads, only 1 CPU requested through ntasks, I_MPI_PIN_DOMAIN=omp. The output is exactly same as above, as ntasks has 1 CPU, so it the scheduler can't allocate more CPUs to the job.
Input:
#SBATCH --ntasks=1
I_MPI_PIN_DOMAIN=omp
mpirun -np 1 -env OMP_NUM_THREADS=4 ./mpi_openmp_test
Test 3: 1 MPI, 4 Threads, 4 CPU requested through ntasks, I_MPI_PIN_DOMAIN=omp. Now the scheduler is allocating 4 different CPUs to the job, but interestingly two of the threads got executed in CPUID 32. Some times 4 individual CPU's are also used (which I expect to be normal). But sometimes three of the thread used same CPU ID's. I am not sure about this behavior. Note, I have an infinite do-while loop in each of the thread. So, it is not that one thread finished the job and release the CPU to other.
Input:
#SBATCH --ntasks=4
I_MPI_PIN_DOMAIN=omp
mpirun -np 1 -env OMP_NUM_THREADS=4 ./mpi_openmp_test
Output:
Rank Pid Node name Pin cpu
0 2769749 spark001 {0,16,32,48}
MPI PROCESS 0 ON spark001; OPENMP THREAD 2; CPUID 16
MPI PROCESS 0 ON spark001; OPENMP THREAD 0; CPUID 48
MPI PROCESS 0 ON spark001; OPENMP THREAD 1; CPUID 32
MPI PROCESS 0 ON spark001; OPENMP THREAD 3; CPUID 32
Test 4: 1 MPI, 8 Threads, 8 CPU requested through ntasks, I_MPI_PIN_DOMAIN=omp. The behavior is exactly same as Test 3.
Input:
#SBATCH --ntasks=8
I_MPI_PIN_DOMAIN=omp
mpirun -np 1 -env OMP_NUM_THREADS=8 ./mpi_openmp_test
Output:
Rank Pid Node name Pin cpu
0 2770372 spark001 {0,1,16,17,32,33,48,49}
MPI PROCESS 0 ON spark001; OPENMP THREAD 4; CPUID 49
MPI PROCESS 0 ON spark001; OPENMP THREAD 7; CPUID 48
MPI PROCESS 0 ON spark001; OPENMP THREAD 6; CPUID 48
MPI PROCESS 0 ON spark001; OPENMP THREAD 3; CPUID 48
MPI PROCESS 0 ON spark001; OPENMP THREAD 2; CPUID 33
MPI PROCESS 0 ON spark001; OPENMP THREAD 1; CPUID 49
MPI PROCESS 0 ON spark001; OPENMP THREAD 0; CPUID 49
MPI PROCESS 0 ON spark001; OPENMP THREAD 5; CPUID 48
Now come to multiple mpi processes with different threads: Test 5: 6 MPI, 30 totalThreads, 30 CPU requested through ntasks, I_MPI_PIN_DOMAIN=omp. I sorted the output by MPI process. The issue with mpirun is that during pinning, it only reads the OMP_NUM_THREADS variable from the first process. As a result, MPI processes with 16 and 2 threads are also allocated only 4 CPUs.
Input:
#SBATCH --ntasks=30
I_MPI_PIN_DOMAIN=omp
mpirun -np 2 -env OMP_NUM_THREADS=4 ./mpi_openmp_test : \
-np 1 -env OMP_NUM_THREADS=16 ./mpi_openmp_test : \
-np 3 -env OMP_NUM_THREADS=2 ./mpi_openmp_test
Output:
Rank Pid Node name Pin cpu
0 2770482 spark001 {2,3,4,5}
1 2770483 spark001 {6,7,8,9}
2 2770484 spark001 {18,19,20,21}
3 2770485 spark001 {22,23,24,25}
4 2770486 spark001 {34,35,36,37}
5 2770487 spark001 {38,39,40,50}
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 5
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 2
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 2 | CPUID | 4
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 3 | CPUID | 3
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 9
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 6
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 2 | CPUID | 8
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 3 | CPUID | 7
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 21
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 18
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 2 | CPUID | 20
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 3 | CPUID | 19
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 4 | CPUID | 21
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 5 | CPUID | 21
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 6 | CPUID | 19
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 7 | CPUID | 21
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 8 | CPUID | 20
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 9 | CPUID | 19
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 10 | CPUID | 18
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 11 | CPUID | 20
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 12 | CPUID | 18
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 13 | CPUID | 18
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 14 | CPUID | 20
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 15 | CPUID | 19
MPI | PROCESS | 3 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 25
MPI | PROCESS | 3 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 22
MPI | PROCESS | 4 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 37
MPI | PROCESS | 4 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 34
MPI | PROCESS | 5 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 50
MPI | PROCESS | 5 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 50
Test 6: 3 MPI, 30 totalThreads, 30 CPU requested through ntasks, I_MPI_PIN_DOMAIN=omp. But used taskset to directly allocate the CPU's. Now, MPI process 2 is allocated exactly 16 CPUs, processes 0 and 1 are allocated 8 CPUs (shared), and processes 3, 4, and 5 are allocated 6 CPUs (shared). However, during execution, I notice that different threads are using the same CPUs, even when others are available. I am unable to explain this behavior.
Input:
#SBATCH --ntasks=30
I_MPI_PIN_DOMAIN=omp
mpirun -np 2 -env OMP_NUM_THREADS=4 taskset -c 0-7 ./mpi_openmp_test : \
-np 1 -env OMP_NUM_THREADS=16 taskset -c 8-23 ./mpi_openmp_test : \
-np 3 -env OMP_NUM_THREADS=2 taskset -c 24-29 ./mpi_openmp_test
Output:
Rank Pid Node name Pin cpu
0 2770661 spark001 {0,1,2,3,4,5,6,7}
1 2770662 spark001 {0,1,2,3,4,5,6,7}
2 2770663 spark001 {8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
3 2770664 spark001 {24,25,26,27,28,29}
4 2770665 spark001 {24,25,26,27,28,29}
5 2770666 spark001 {24,25,26,27,28,29}
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 0
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 1
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 2 | CPUID | 1
MPI | PROCESS | 0 | ON | spark001 | OPENMP | THREAD | 3 | CPUID | 1
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 7
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 0
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 2 | CPUID | 1
MPI | PROCESS | 1 | ON | spark001 | OPENMP | THREAD | 3 | CPUID | 3
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 23
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 12
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 2 | CPUID | 23
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 3 | CPUID | 23
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 4 | CPUID | 17
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 5 | CPUID | 17
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 6 | CPUID | 17
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 7 | CPUID | 17
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 8 | CPUID | 16
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 9 | CPUID | 16
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 10 | CPUID | 16
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 11 | CPUID | 16
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 12 | CPUID | 15
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 13 | CPUID | 11
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 14 | CPUID | 10
MPI | PROCESS | 2 | ON | spark001 | OPENMP | THREAD | 15 | CPUID | 14
MPI | PROCESS | 3 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 29
MPI | PROCESS | 3 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 28
MPI | PROCESS | 4 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 29
MPI | PROCESS | 4 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 26
MPI | PROCESS | 5 | ON | spark001 | OPENMP | THREAD | 0 | CPUID | 29
MPI | PROCESS | 5 | ON | spark001 | OPENMP | THREAD | 1 | CPUID | 27
Why not this:
Input:
#SBATCH --ntasks=30
I_MPI_PIN_DOMAIN=omp
mpirun -np 1 -env OMP_NUM_THREADS=4 taskset -c 0-3 ./mpi_openmp_test : \
-np 1 -env OMP_NUM_THREADS=4 taskset -c 4-7 ./mpi_openmp_test : \
-np 1 -env OMP_NUM_THREADS=16 taskset -c 8-23 ./mpi_openmp_test : \
-np 1 -env OMP_NUM_THREADS=2 taskset -c 24-25 ./mpi_openmp_test : \
-np 1 -env OMP_NUM_THREADS=2 taskset -c 26-27 ./mpi_openmp_test : \
-np 1 -env OMP_NUM_THREADS=2 taskset -c 28-29 ./mpi_openmp_test : \
Why not this:
Input: #SBATCH --ntasks=30 I_MPI_PIN_DOMAIN=omp mpirun -np 1 -env OMP_NUM_THREADS=4 taskset -c 0-3 ./mpi_openmp_test : \ -np 1 -env OMP_NUM_THREADS=4 taskset -c 4-7 ./mpi_openmp_test : \ -np 1 -env OMP_NUM_THREADS=16 taskset -c 8-23 ./mpi_openmp_test : \ -np 1 -env OMP_NUM_THREADS=2 taskset -c 24-25 ./mpi_openmp_test : \ -np 1 -env OMP_NUM_THREADS=2 taskset -c 26-27 ./mpi_openmp_test : \ -np 1 -env OMP_NUM_THREADS=2 taskset -c 28-29 ./mpi_openmp_test : \
Yes, that's perfectly fine. It simply means you'll need to specify each MPI task individually. When you require a single MPI process with a different number of threads, I thought it would be easier to just specify that process separately, rest can be grouped. :)
For the moment, this is going to be a "special case". I don't want to work this into qfds.sh, for example. My thought would be to use qfds.sh to create a basic script that can be modified. Then I'd like to see if the detailed chemistry cases can be run by assigning more CPUs to the meshes that need it.
For the moment, this is going to be a "special case". I don't want to work this into qfds.sh, for example. My thought would be to use qfds.sh to create a basic script that can be modified. Then I'd like to see if the detailed chemistry cases can be run by assigning more CPUs to the meshes that need it.
Agreed. Marcos and I will work on making the chemistry call thread-safe. It seems quite possible.
Here is a simple Hello World program that uses both OpenMP and MPI. Can we run this with a different number of OpenMP threads per MPI process.