Open berland opened 1 month ago
Proof of concept:
$ /usr/lib64/openmpi/bin/mpicxx -fopenmp -std=c++17 -o omp_mpi omp_mpi.c -lgomp
$ cat runme.sh
/usr/lib64/openmpi/bin/mpirun -np 8 ./omp_mpi
$ time bash runme.sh
I'm thread 2 out of 10 on MPI process nr. 2 out of 8, while hardware_concurrency reports 10 processors
I'm thread 6 out of 10 on MPI process nr. 2 out of 8, while hardware_concurrency reports 10 processors
[...snip...]
I'm thread 4 out of 10 on MPI process nr. 0 out of 8, while hardware_concurrency reports 10 processors
I'm thread 5 out of 10 on MPI process nr. 0 out of 8, while hardware_concurrency reports 10 processors
I'm thread 6 out of 10 on MPI process nr. 0 out of 8, while hardware_concurrency reports 10 processors
I'm thread 1 out of 10 on MPI process nr. 0 out of 8, while hardware_concurrency reports 10 processors
real 0m5.723s
user 0m31.502s
sys 0m11.343s
$ cat mpmpi.c
#include <iostream>
#include <mpi.h>
#include <thread>
#include <sstream>
#include <omp.h>
#include <time.h>
int main(int args, char *argv[]) {
int rank, nprocs, thread_id, nthreads, cxx_procs;
MPI_Init(&args, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#pragma omp parallel private(thread_id, nthreads, cxx_procs)
{
const double ticks_per_sec = (double)CLOCKS_PER_SEC;
clock_t start = clock();
thread_id = omp_get_thread_num();
nthreads = omp_get_num_threads();
cxx_procs = std::thread::hardware_concurrency();
std::stringstream omp_stream;
omp_stream << "I'm thread " << thread_id
<< " out of " << nthreads
<< " on MPI process nr. " << rank
<< " out of " << nprocs
<< ", while hardware_concurrency reports " << cxx_procs
<< " processors\n";
std::cout << omp_stream.str();
volatile double dummy;
while (1) {
for (int i = 0; i < 1000; ++i) {
dummy = i * 3.14159;
}
double elapsed = (double)(clock() - start) / ticks_per_sec;
if (elapsed >= 200)
break;
}
}
MPI_Finalize();
return 0;
}
Experimenting with OMP_NUM_THREADS and the -np
option, it seems we can only detect the case when -np
is more than 1.
tested some on a RHEL8 node. time seems to always give correct numbers. the weird thing is that mpirun with -np 1 or 2 restricts openmp from running on any other physical core. you can see it if you run htop in a different terminal.
time OMP_NUM_THREADS=10 /usr/lib64/openmpi/bin/mpirun -np 1 ./omp_mpi
This will show 1 process at 100% and others at 10%, but you can also see in htop that only 1 core is utilized. time will show same user and total time.
If you increase -np to 3 or higher then it will no longer pin the cores the program runs on.
adding --bind-to core
when using -np >= 3 will give the same behavior as -np 1 or 2
adding --bind-to none
will give the same behavior as -np >= 3
you can add --report-bindings
to see if processes are bound to cores or not
--bind-to core -cpus-per-proc 2
will bind 2 cores per process.
-np 2 --bind-to core -cpus-per-proc 2
will give 2 processes bound to 2 cores each
The forward model step runner (
job_dispatch
) already reports memory back to the Ert application. It can similarly report back cpu time consumption. Today the ERT gui reports the wall-clock time duration of a forward model step.It is also possible to ask the OS for the cpu time for a process and its descendants. When the ERT gui receives this information, it can compare CPU time to wall clock time and detect if parallelization has been at play, and if so, compare it to NUM_CPU. Typically, if NUM_CPU is 1 and a process has used significant time and it has been in parallel, it should be reported back to the user as a warning.
Steps to achieve this: