Whether need to specify OMP_NUM_THREADS? If so, maximum number of cores recommended?

geoschem / geos-chem-cloud

Run GEOS-Chem easily on AWS cloud

http://cloud.geos-chem.org

MIT License

39 stars 9 forks source link

Whether need to specify OMP_NUM_THREADS? If so, maximum number of cores recommended? #28

Closed FeiYao-Edinburgh closed 4 years ago

FeiYao-Edinburgh commented 4 years ago

Hello,

I have gone through your wiki page on Setting Unix environment variables for GEOS-Chem. However, I found that GEOSChem_env file does not specify the OMP_NUM_THREADS. In this sense, I just wonder will make -j4 mpbuild make the geos.mp use all the cores available automatically? If I choose to set OMP_NUM_THREADS as the maximum number of cores I have in ~/.bashrc, will it give me the quickest speed? Previously, I have tried c5.9xlarge and c5.18xlarge with they having 18 and 36 Cores, respectively. However, the latter one did not double the speed of the former one. Hope you could clarify these kinds of things to me. Many thanks in advance!

Yours faithfully, Fei

JiaweiZhuang commented 4 years ago

If I choose to set OMP_NUM_THREADS as the maximum number of cores I have in ~/.bashrc, will it give me the quickest speed?

OMP_NUM_THREADS is set to the number of available cores by default (at least on the Ubuntu AMI). You can also set it explicitly, but it shouldn't affect performance, unless you deliberately set a lower number. Keeping it empty is convenient as it will choose the number of threads according to your EC2 instance size :)

You can use this OpenMP Hello World to print the number of threads used.

the latter one did not double the speed of the former one.

This is expected as most code does not scale perfectly (Amdahl's law). See GEOS-Chem_scalability for more info.

FeiYao-Edinburgh commented 4 years ago

Keeping it empty is convenient as it will choose the number of threads according to your EC2 instance size :)

Does it apply to local servers similarly? I found this confusing because I read IMPORTANT! If you forget to define OMP_NUM_THREADS in your Unix environment and/or run scripts, then GEOS-Chem will only execute using one core. This can cause GEOS-Chem to execute much more slowly than intended. from this page. If I set it, will you recommend, at least theoretically, set its number as the maximum number of cores that I have so as to achieve the best performance? If not, how to define it when running scripts? The only possible way that I can think is something like make -j4 mpbuild that mpbuild tells to use multiple processors, but how many will it use?

This is expected as most code does not scale perfectly (Amdahl's law). See GEOS-Chem_scalability for more info.

Thanks. Good to know.

JiaweiZhuang commented 4 years ago

The only possible way that I can think is something like make -j4 mpbuild that mpbuild tells to use multiple processors, but how many will it use?

The number of OpenMP threads is determined at run time, not compile time. make mpbuild is just to add the -fopenmp flag so that OpenMP is enabled.

Does it apply to local servers similarly?

The behavior might depend on the compiler. For example, from IBM XL compiler docs:

If you do not set the OMP_NUM_THREADS environment variable, the number of processors available is the default value to form a new team for the first encountered parallel construct.

FeiYao-Edinburgh commented 4 years ago

The number of OpenMP threads is determined at run time, not compile time. make mpbuild is just to add the -fopenmp flag so that OpenMP is enabled.

Thanks for your great explanation! This really makes sense.

The behavior might depend on the compiler.

Hmm... I must admit that this is beyond my knowledge. I use Intel Fortran compiler, i.e. ifort, despite that GNU Fortran compiler, i.e. gortran, has also been installed in my two machines with 40 and 32 cores, respectively (see following). Do you have any suggestions for the value of OMP_NUM_THREADS for each mahine?

CPU(s):                80
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2

CPU(s):                64
Thread(s) per core:    2
Core(s) per socket:    32
Socket(s):             1

JiaweiZhuang commented 4 years ago

Do you have any suggestions for the value of OMP_NUM_THREADS for each mahine?

You can use this test script openmp_hello.c:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {

int nthreads, tid;

#pragma omp parallel private(nthreads, tid)
 {
 tid = omp_get_thread_num();
 nthreads = omp_get_num_threads();
 printf("Hello World from thread %d of %d\n", tid, nthreads);
 }

}

which will print how many threads are actually used on your machine:

$ icc -qopenmp  -o openmp_hello.x openmp_hello.c  # Intel compiler
$ # gcc -fopenmp  -o openmp_hello.x openmp_hello.c  # or GNU compiler
$ unset OMP_NUM_THREADS  # use default value
$ ./openmp_hello.x  # on a 4-core machine
Hello World from thread 0 of 4
Hello World from thread 2 of 4
Hello World from thread 1 of 4
Hello World from thread 3 of 4
$ export OMP_NUM_THREADS=1  # force one thread
$ ./openmp_hello.x
Hello World from thread 0 of 1

FeiYao-Edinburgh commented 4 years ago

which will print how many threads are actually used on your machine

Thanks for your further reply. I have run the program you provided. I found that the number of threads exactly equalled the number of CPU(s) as I listed above. Therefore, I only need to set OMP_NUM_THREADS a number less than the number of CPU(s) but the greater the better? Frankly, I almost got lost by cores, threads, CPU(s), and etc. I am sure that cores and CPU(s) are different things. However, I found The OMP_NUM_THREADS environment variable sets the number of computational cores (aka threads) in this page, which equal cores and threads. Since the number of threads is identical to CPU(s), these three are totally same? Or is it just a coincidence for my machines?

I really know that I need more reading to understand these things and I will do it in a later time by myself. Regarding the outcome running the codes you provide, do you recommend specifying OMP_NUM_THREADS as the number of the threads or CPU(s) that I have or just not specifying it?

JiaweiZhuang commented 4 years ago

I almost got lost by cores, threads, CPU(s), and etc.

Most of time, "core" is a physical/hardware concept (an attribute of your machine), while "thread" is a software concept (determined by your software program). The definition can vary in different contexts -- sometimes people talk about "hardware threads", but in general you can think of "threads" just a software thing, representing how many tasks are executed concurrently by the program.

Therefore, I only need to set OMP_NUM_THREADS a number less than the number of CPU(s) but the greater the better?

Most of time you should set num_threads = num_cores, so that each software thread can run on exactly one hardware core. If num_threads < num_cores, there will be unused cores. If num_threads > num_cores, then the physical scores will be oversubscribed (often slows down the program).

do you recommend specifying OMP_NUM_THREADS as the number of the threads or CPU(s) that I have or just not specifying it?

You can explicitly set it to the number of cores, if you are unsure about the default behavior. On the EC2 instance, this is not necessary.

FeiYao-Edinburgh commented 4 years ago

Most of time you should set num_threads = num_cores, so that each software thread can run on exactly one hardware core.

This is somewhat the answer that I am looking for! Nevertheless, I still have some confusions that appreciate your further help. Considering the following server information run from lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(', it is clearly that num_cores=32x1=32 but num_threads=2x32x1=64. I believe this is because the server has used some hyper-thread technology. openmp_hello.c testing also told that 64 threads are actually used when running the program. In this case, should I export OMP_NUM_THREADS=32 or export OMP_NUM_THREADS=64. I feel it should be the later one according to name of OMP_NUM_THREADS?

CPU(s):                64
Thread(s) per core:    2
Core(s) per socket:    32
Socket(s):             1

so that each software thread can run on exactly one hardware core.

This is ideal for case num_cores=num_threads. For hyper-thread case in which num_threads is certain times of num_cores, would it be great to export OMP_NUM_THREADS=num_threads? If so, export OMP_NUM_THREADS=num_threads is universal. If not, is it because of several threads on the same core shared some common resources that cause them cannot got run simultaneously or concurrently? I feel the former one is the answer?

On the EC2 instance, this is not necessary.

Yes. AWS is great in that it removes a great deal of technical batteries. Nevertheless, I can only take it as an additional resource due to limited funding resources.

This is expected as most code does not scale perfectly (Amdahl's law). See GEOS-Chem_scalability for more info.

This might be a very tricky question. Since most code does not scale perfectly, it would be very hard to determine the type of EC2 instances to use for different simulations so as to obtain the minimum price per total running time.

FeiYao-Edinburgh commented 4 years ago

Any further discussion?

JiaweiZhuang commented 4 years ago

For hyper-thread case in which num_threads is certain times of num_cores, would it be great to export OMP_NUM_THREADS=num_threads?

In my tests, hyperthreading does speed up GEOS-Chem OpenMP a bit, by ~10%. So export OMP_NUM_THREADS=64 should be slightly faster than export OMP_NUM_THREADS=32 in your case. This might not be true for other code, though. See Disabling Intel Hyper-Threading Technology on Amazon Linux if you are interested in more details.