Closed kostrzewa closed 5 years ago
@chelmes: we will need to redo the benchmarks on Marconi A2!
Okay, so threading now works, but the scaling is a little poor.
1 thread:
# Time for cloverdetratio3light monomial derivative: 7.078156e+00 s
7 threads:
# Time for cloverdetratio3light monomial derivative: 4.506283e+00 s
But that's something at least. This should allow us to run on much larger machine partitions on SuperMUC and similar machines.
The reason I didn't notice the bug is because DDalphaAMG (or the interface) suppresses the log output
running with N openmp threads per core
if
mg_init.number_openmp_threads==1;
@sbacchio : What's your feelling about the observed speed-up (7 to 4.5 seconds), can this be improved?
Ok, good to know! :) Yes, yesterday when you explained it, it was clear was a kind of bug. I knew that the openMP implementation was not as good as MPI but at least it was doing something. I'm sorry I used the wrong flag...
So regarding the timings I guess that's for KNL? or with a high number of MPI processes? What I remember is that openMP was working fine if used in replacement of MPI tasks. Instead over the maximum number of MPI tasks was quite poor.
So regarding the timings I guess that's for KNL? or with a high number of MPI processes? What I remember is that openMP was working fine if used in replacement of MPI tasks. Instead over the maximum number of MPI tasks was quite poor.
No, this was on SuperMUC, a 48^3x96 lattice on 324 nodes, 4 MPI tasks per node, 7 threads per task. The node-local lattice volume is only 8^3x16 in this case, so I think that explains the poor speed-up. On fewer nodes (four times fewer, say), it should be faster. Will try!
Yes, it's better, I get:
7 threads:
# Time for cloverdetratio3light monomial derivative: 8.929407e+00 s
1 thread:
# Time for cloverdetratio3light monomial derivative: 2.515449e+01 s
Looks good enough to get good trajectory times. 48^3x96 lattice on 81 nodes (4 MPI tasks per node, 7 threads per task)!
okay, that looks like very good progress!
Yes, it's better, I get:
7 threads:
# Time for cloverdetratio3light monomial derivative: 8.929407e+00 s
1 thread:
# Time for cloverdetratio3light monomial derivative: 2.515449e+01 s
Looks good enough to get good trajectory times. 48^3x96 lattice on 81 nodes (4 MPI tasks per node, 7 threads per task)!
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/etmc/tmLQCD/issues/400#issuecomment-337500088
Carsten Urbach e-mail: curbach@gmx.de urbach@hiskp.uni-bonn.de Fon : +49 (0)228 73 2379 skype : carsten.urbach URL: http://www.carsten-urbach.eu
wow, this is very promising! do you think the fix could be used immediately?
Hi, congratulations for this good progress with threading for DDalphaAMG, cheers, RF
On 2017-10-18 10:22, Bartosz Kostrzewa wrote:
Yes, it's better, I get:
7 threads:
Time for cloverdetratio3light monomial derivative: 8.929407e+00 s
1 thread:
Time for cloverdetratio3light monomial derivative: 2.515449e+01 s
Looks good enough to get good trajectory times. 48^3x96 lattice on 81 nodes (4 MPI tasks per node, 7 threads per task)!
-- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [1], or mute the thread [2].
*
Links:
[1] https://github.com/etmc/tmLQCD/issues/400#issuecomment-337500088 [2] https://github.com/notifications/unsubscribe-auth/ANNXmd4k-5oQlJoTNs_DIgJ_NryMP9fmks5stbUxgaJpZM4P9MOG
@sunpho84 @robfre21 We shouldn't get ahead of ourselves. We've simply found a bug in the interface which prevented us to even run threaded inversions, but we still have somewhat mediocre scaling, at least from this first test. An improvement by a factor 2.7 by increasing the number of used cores by a factor of 7 is sufficiently poor to wash out all the improvement that you get on the QPhiX side by being able to run with more threads than MPI tasks. For the case that I tested thus, a highly threaded tmLQCD+QPhiX+DDalphaAMG job is about as fast (or slow, if you're a pessimist) as a high-MPI tmLQCD+QPhiX+DDalphaAMG job.
@sunpho84 If you could correct the single line and recompile in the version of tmLQCD that you're using, you can do a scaling study of DDalphAMG on Marconi A2 as a function of the MPI task <-> thread balance. This would be very helpful. It would be great to have that in the next few days.
It does mean though, that one can scale to slightly higher node counts, leading to a reduction of wall clock time per trajectory, but reduced efficiency.
@sbacchio @Finkenrath : TODO in DDalphaAMG interface: store current number of threads, start DDalphaAMG solve (which may change the number of active OpenMP threads) and reset it afterwards.
This has been resolved I think.
@sbacchio @urbach @Finkenrath
I think I found the reason why the threading doesn't work right now... in DDalphaAMG interface:
should be