Some questions ... - Githubissues

somera commented 4 years ago

I try to understand the code. ;)

I didn't know that's possible mix OpenMPI and OpenMP code. ;) I have two NUC MiniPC. Each with Intel i5-8259U. Means 4C/8T. Sounds good. I installed only Ubutu server on both NUCs. In this case I'm testing the parallel-no-gui version.

I'm not using your start script. My start parameters are:

mpirun --use-hwthread-cpus --mca btl tcp,self --mca btl_tcp_if_include eno1 --hostfile host_file ...

--use-hwthread-cpus -> I will use each cpu thread --mca btl tcp,self --mca btl_tcp_if_include eno1 -> cause I have other interfaces, OpenMPI should use eno1 with TCP

In host_file are the both NUCs.

For better reading:

start_part = mpirun --use-hwthread-cpus --mca btl tcp,self --mca btl_tcp_if_include eno1 --hostfile host_file

Now the results:

$ start_part -np 16 ./Wator.out 1000 100 1
Dimension: 1000 Iterations: 100 MPI processes: 16 OpenMP threads: 1 execution time: 5.724459
$ start_part -np 16 ./Wator.out 1000 10 1
Dimension: 1000 Iterations: 10 MPI processes: 16 OpenMP threads: 1 execution time: 0.626058
$ start_part -np 16 ./Wator.out 1000 10 2
Dimension: 1000 Iterations: 10 MPI processes: 16 OpenMP threads: 2 execution time: 4.975337
$ start_part -np 8 ./Wator.out 1000 10 2
Dimension: 1000 Iterations: 10 MPI processes: 8 OpenMP threads: 2 execution time: 4.412533
$ start_part -np 16 ./Wator.out 1000 50 1
Dimension: 1000 Iterations: 50 MPI processes: 16 OpenMP threads: 1 execution time: 2.940826
$ start_part -np 16 ./Wator.out 1000 100 1
Dimension: 1000 Iterations: 100 MPI processes: 16 OpenMP threads: 1 execution time: 5.767414
$ start_part -np 16 ./Wator.out 1000 200 1
Dimension: 1000 Iterations: 200 MPI processes: 16 OpenMP threads: 1 execution time: 11.562824
$ start_part -np 16 ./Wator.out 1000 300 1
Dimension: 1000 Iterations: 300 MPI processes: 16 OpenMP threads: 1 execution time: 17.423063

You can see, that when I start Wator on 16 OpenMPI Threads it's is faster with OpenMP-Thread=1:

$ start_part -np 16 ./Wator.out 1000 10 1
Dimension: 1000 Iterations: 10 MPI processes: 16 OpenMP threads: 1 execution time: 0.626058
$ start_part -np 16 ./Wator.out 1000 10 2
Dimension: 1000 Iterations: 10 MPI processes: 16 OpenMP threads: 2 execution time: 4.975337

And there is other problem. When I set OpenMP-Threads = 2 OpenMP is starting on every NUC 8 Threds. One for every CPU Thread.

Now I can do this: export OMP_NUM_THREADS=1. But this works only on the master NUC. Here with OpenMP-Threads = 2

mpirun ... -np 16 ./Wator.out 1000 300 2

And with too much OpenMP threads the application slows down.

somera commented 4 years ago

Here results when I run it on both NUCs:

$ mpirun ... --npernode 2 --np 4 ./Wator.out 1000 1000 1 -> on each NUC 2 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 4 OpenMP threads: 1 execution time: 74.708360
$ mpirun ... --npernode 3 --np 6 ./Wator.out 1000 1000 1 -> on each NUC 3 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 6 OpenMP threads: 1 execution time: 72.989487
$ mpirun ... --npernode 4 --np 8 ./Wator.out 1000 1000 1 -> on each NUC 4 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 8 OpenMP threads: 1 execution time: 70.067852
$ mpirun ... --npernode 8 --np 16 ./Wator.out 1000 1000 1 -> on each NUC 8 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 16 OpenMP threads: 1 execution time: 57.654790

Here results when I run it only on one NUC:

$ mpirun ... --np 2 ./Wator.out 1000 1000 1 -> on one NUC 2 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 2 OpenMP threads: 1 execution time: 35.135446
$ mpirun ... --np 4 ./Wator.out 1000 1000 1 -> on one NUC 4 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 4 OpenMP threads: 1 execution time: 14.892726
$ mpirun ... --np 6 ./Wator.out 1000 1000 1 -> on one NUC 6 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 6 OpenMP threads: 1 execution time: 15.576039
$ mpirun ... --np 8 ./Wator.out 1000 1000 1 -> on one NUC 8 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 8 OpenMP threads: 1 execution time: 13.424243

I'm not understanding why the runtime is so slow when I run it on both NUCs.

Dygwah98 commented 4 years ago

Hi somera! Thanks for stopping by.

Bear in mind, I didn't touch this code since June because I was too busy. In the future, when I'll have some spare time, I'll try to work on this but I can't guarantee anything. Concerning the problem with performance decrease with more OpenMP Threads, this is a known issue and one I couldn't solve: possible causes are

the way the two libraries interact (they have different memory models);
the execution script (I have no way to repeat your tests in a similar environment at the moment, so I can't guarantee there's something wrong with how it handles OpenMP);
the OpenMP setup in a multi-machine environment (which I haven't considered in the past due to limited hardware resources, and which may be the cause for the anomalies you've found).

Last, I already know that the implementation suffers from a considerable bottleneck with may actually worsen on machines with different memories: the matrix underlying the calculations is fully communicated to each slave, instead of being maintained only in the master, and each slave receiving only the section it needs for calculations, as it should. If and when I'll have the time to solve this error in the implementation, I'll guess timings could change considerably.

I'm sorry I'm not being of help here, but unfortunately I'll be really busy for the next weeks, and I can't dedicate time on this.

If you find something useful, or happen to solve some of the problems above, feel free to collaborate.

Regards, Dygwah

somera commented 4 years ago

Hi Dygwah, thy for the response.

I'm busy too. But when I have some time I try do understand the code mix. Or I try to rewrite it to use only Open-MPI.

Regards, Rafal

Dygwah98 / ParallelWatorCA

Some questions ... #1