equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
https://ert.readthedocs.io/en/latest/
GNU General Public License v3.0
103 stars 107 forks source link

Detect over-spending of CPU #8674

Open berland opened 1 month ago

berland commented 1 month ago

The forward model step runner (job_dispatch) already reports memory back to the Ert application. It can similarly report back cpu time consumption. Today the ERT gui reports the wall-clock time duration of a forward model step.

It is also possible to ask the OS for the cpu time for a process and its descendants. When the ERT gui receives this information, it can compare CPU time to wall clock time and detect if parallelization has been at play, and if so, compare it to NUM_CPU. Typically, if NUM_CPU is 1 and a process has used significant time and it has been in parallel, it should be reported back to the user as a warning.

Steps to achieve this:

berland commented 1 month ago

Proof of concept:

$ /usr/lib64/openmpi/bin/mpicxx -fopenmp -std=c++17 -o omp_mpi omp_mpi.c -lgomp
$ cat runme.sh 
/usr/lib64/openmpi/bin/mpirun -np 8 ./omp_mpi
$ time bash runme.sh
I'm thread 2 out of 10 on MPI process nr. 2 out of 8, while hardware_concurrency reports 10 processors
I'm thread 6 out of 10 on MPI process nr. 2 out of 8, while hardware_concurrency reports 10 processors
[...snip...]
I'm thread 4 out of 10 on MPI process nr. 0 out of 8, while hardware_concurrency reports 10 processors
I'm thread 5 out of 10 on MPI process nr. 0 out of 8, while hardware_concurrency reports 10 processors
I'm thread 6 out of 10 on MPI process nr. 0 out of 8, while hardware_concurrency reports 10 processors
I'm thread 1 out of 10 on MPI process nr. 0 out of 8, while hardware_concurrency reports 10 processors

real    0m5.723s
user    0m31.502s
sys 0m11.343s
$ cat mpmpi.c 
#include <iostream>
#include <mpi.h>
#include <thread>
#include <sstream>
#include <omp.h>
#include <time.h>

int main(int args, char *argv[]) {
    int rank, nprocs, thread_id, nthreads, cxx_procs;
    MPI_Init(&args, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    #pragma omp parallel private(thread_id, nthreads, cxx_procs)
    {
        const double ticks_per_sec = (double)CLOCKS_PER_SEC;
        clock_t start = clock();

        thread_id = omp_get_thread_num();
        nthreads = omp_get_num_threads();
        cxx_procs = std::thread::hardware_concurrency();
        std::stringstream omp_stream;
        omp_stream << "I'm thread " << thread_id
        << " out of " << nthreads
        << " on MPI process nr. " << rank
        << " out of " << nprocs
        << ", while hardware_concurrency reports " << cxx_procs
        << " processors\n";
        std::cout << omp_stream.str();
    volatile double dummy;
    while (1) {
           for (int i = 0; i < 1000; ++i) {
               dummy = i * 3.14159;
           }
           double elapsed = (double)(clock() - start) / ticks_per_sec;
           if (elapsed >= 200)
               break;
        }
    }
    MPI_Finalize();
    return 0;
}
berland commented 1 month ago

Experimenting with OMP_NUM_THREADS and the -np option, it seems we can only detect the case when -np is more than 1.

JHolba commented 1 month ago

tested some on a RHEL8 node. time seems to always give correct numbers. the weird thing is that mpirun with -np 1 or 2 restricts openmp from running on any other physical core. you can see it if you run htop in a different terminal.

time OMP_NUM_THREADS=10 /usr/lib64/openmpi/bin/mpirun -np 1 ./omp_mpi This will show 1 process at 100% and others at 10%, but you can also see in htop that only 1 core is utilized. time will show same user and total time.

If you increase -np to 3 or higher then it will no longer pin the cores the program runs on.

JHolba commented 1 month ago

adding --bind-to core when using -np >= 3 will give the same behavior as -np 1 or 2 adding --bind-to none will give the same behavior as -np >= 3 you can add --report-bindings to see if processes are bound to cores or not

--bind-to core -cpus-per-proc 2 will bind 2 cores per process. -np 2 --bind-to core -cpus-per-proc 2 will give 2 processes bound to 2 cores each