Open somera opened 4 years ago
Here results when I run it on both NUCs:
$ mpirun ... --npernode 2 --np 4 ./Wator.out 1000 1000 1 -> on each NUC 2 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 4 OpenMP threads: 1 execution time: 74.708360
$ mpirun ... --npernode 3 --np 6 ./Wator.out 1000 1000 1 -> on each NUC 3 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 6 OpenMP threads: 1 execution time: 72.989487
$ mpirun ... --npernode 4 --np 8 ./Wator.out 1000 1000 1 -> on each NUC 4 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 8 OpenMP threads: 1 execution time: 70.067852
$ mpirun ... --npernode 8 --np 16 ./Wator.out 1000 1000 1 -> on each NUC 8 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 16 OpenMP threads: 1 execution time: 57.654790
Here results when I run it only on one NUC:
$ mpirun ... --np 2 ./Wator.out 1000 1000 1 -> on one NUC 2 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 2 OpenMP threads: 1 execution time: 35.135446
$ mpirun ... --np 4 ./Wator.out 1000 1000 1 -> on one NUC 4 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 4 OpenMP threads: 1 execution time: 14.892726
$ mpirun ... --np 6 ./Wator.out 1000 1000 1 -> on one NUC 6 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 6 OpenMP threads: 1 execution time: 15.576039
$ mpirun ... --np 8 ./Wator.out 1000 1000 1 -> on one NUC 8 Threads
Dimension: 1000 Iterations: 1000 MPI processes: 8 OpenMP threads: 1 execution time: 13.424243
I'm not understanding why the runtime is so slow when I run it on both NUCs.
Hi somera! Thanks for stopping by.
Bear in mind, I didn't touch this code since June because I was too busy. In the future, when I'll have some spare time, I'll try to work on this but I can't guarantee anything. Concerning the problem with performance decrease with more OpenMP Threads, this is a known issue and one I couldn't solve: possible causes are
Last, I already know that the implementation suffers from a considerable bottleneck with may actually worsen on machines with different memories: the matrix underlying the calculations is fully communicated to each slave, instead of being maintained only in the master, and each slave receiving only the section it needs for calculations, as it should. If and when I'll have the time to solve this error in the implementation, I'll guess timings could change considerably.
I'm sorry I'm not being of help here, but unfortunately I'll be really busy for the next weeks, and I can't dedicate time on this.
If you find something useful, or happen to solve some of the problems above, feel free to collaborate.
Regards, Dygwah
Hi Dygwah, thy for the response.
I'm busy too. But when I have some time I try do understand the code mix. Or I try to rewrite it to use only Open-MPI.
Regards, Rafal
I try to understand the code. ;)
I didn't know that's possible mix OpenMPI and OpenMP code. ;) I have two NUC MiniPC. Each with Intel i5-8259U. Means 4C/8T. Sounds good. I installed only Ubutu server on both NUCs. In this case I'm testing the parallel-no-gui version.
I'm not using your start script. My start parameters are:
mpirun --use-hwthread-cpus --mca btl tcp,self --mca btl_tcp_if_include eno1 --hostfile host_file ...
--use-hwthread-cpus -> I will use each cpu thread --mca btl tcp,self --mca btl_tcp_if_include eno1 -> cause I have other interfaces, OpenMPI should use eno1 with TCP
In host_file are the both NUCs.
For better reading:
start_part = mpirun --use-hwthread-cpus --mca btl tcp,self --mca btl_tcp_if_include eno1 --hostfile host_file
Now the results:
You can see, that when I start Wator on 16 OpenMPI Threads it's is faster with OpenMP-Thread=1:
And there is other problem. When I set OpenMP-Threads = 2 OpenMP is starting on every NUC 8 Threds. One for every CPU Thread.
Now I can do this: export OMP_NUM_THREADS=1. But this works only on the master NUC. Here with OpenMP-Threads = 2
mpirun ... -np 16 ./Wator.out 1000 300 2
And with too much OpenMP threads the application slows down.