DDalphaAMG threading issue

etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.

http://www.itkp.uni-bonn.de/~urbach/software.html

GNU General Public License v3.0

32 stars 47 forks source link

DDalphaAMG threading issue #400

Closed kostrzewa closed 5 years ago

kostrzewa commented 6 years ago

@sbacchio @urbach @Finkenrath

I think I found the reason why the threading doesn't work right now... in DDalphaAMG interface:

#ifdef OMP
  if(mg_omp_num_threads<=0)
      mg_init.number_openmp_threads=omp_num_threads;
  else
      mg_init.number_openmp_threads=mg_omp_num_threads;
#else
  mg_init.number_openmp_threads=1;
#endif

should be

#ifdef TM_USE_OMP
  if(mg_omp_num_threads<=0)
      mg_init.number_openmp_threads=omp_num_threads;
  else
      mg_init.number_openmp_threads=mg_omp_num_threads;
#else
  mg_init.number_openmp_threads=1;
#endif

kostrzewa commented 6 years ago

@chelmes: we will need to redo the benchmarks on Marconi A2!

kostrzewa commented 6 years ago

Okay, so threading now works, but the scaling is a little poor.

1 thread:

# Time for cloverdetratio3light monomial derivative: 7.078156e+00 s

7 threads:

# Time for cloverdetratio3light monomial derivative: 4.506283e+00 s

But that's something at least. This should allow us to run on much larger machine partitions on SuperMUC and similar machines.

kostrzewa commented 6 years ago

The reason I didn't notice the bug is because DDalphaAMG (or the interface) suppresses the log output

running with N openmp threads per core

mg_init.number_openmp_threads==1;

@sbacchio : What's your feelling about the observed speed-up (7 to 4.5 seconds), can this be improved?

sbacchio commented 6 years ago

Ok, good to know! :) Yes, yesterday when you explained it, it was clear was a kind of bug. I knew that the openMP implementation was not as good as MPI but at least it was doing something. I'm sorry I used the wrong flag...

So regarding the timings I guess that's for KNL? or with a high number of MPI processes? What I remember is that openMP was working fine if used in replacement of MPI tasks. Instead over the maximum number of MPI tasks was quite poor.

kostrzewa commented 6 years ago

So regarding the timings I guess that's for KNL? or with a high number of MPI processes? What I remember is that openMP was working fine if used in replacement of MPI tasks. Instead over the maximum number of MPI tasks was quite poor.

No, this was on SuperMUC, a 48^3x96 lattice on 324 nodes, 4 MPI tasks per node, 7 threads per task. The node-local lattice volume is only 8^3x16 in this case, so I think that explains the poor speed-up. On fewer nodes (four times fewer, say), it should be faster. Will try!

kostrzewa commented 6 years ago

Yes, it's better, I get:

7 threads:

# Time for cloverdetratio3light monomial derivative: 8.929407e+00 s

1 thread:

# Time for cloverdetratio3light monomial derivative: 2.515449e+01 s

Looks good enough to get good trajectory times. 48^3x96 lattice on 81 nodes (4 MPI tasks per node, 7 threads per task)!

urbach commented 6 years ago

okay, that looks like very good progress!

Yes, it's better, I get:

7 threads:
# Time for cloverdetratio3light monomial derivative: 8.929407e+00 s
1 thread:
# Time for cloverdetratio3light monomial derivative: 2.515449e+01 s
Looks good enough to get good trajectory times. 48^3x96 lattice on 81 nodes (4 MPI tasks per node, 7 threads per task)!

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/etmc/tmLQCD/issues/400#issuecomment-337500088

Carsten Urbach e-mail: curbach@gmx.de urbach@hiskp.uni-bonn.de Fon : +49 (0)228 73 2379 skype : carsten.urbach URL: http://www.carsten-urbach.eu

sunpho84 commented 6 years ago

wow, this is very promising! do you think the fix could be used immediately?

robfre21 commented 6 years ago

Hi, congratulations for this good progress with threading for DDalphaAMG, cheers, RF

On 2017-10-18 10:22, Bartosz Kostrzewa wrote:

Yes, it's better, I get:

7 threads:

Time for cloverdetratio3light monomial derivative: 8.929407e+00 s

1 thread:

Time for cloverdetratio3light monomial derivative: 2.515449e+01 s

Looks good enough to get good trajectory times. 48^3x96 lattice on 81 nodes (4 MPI tasks per node, 7 threads per task)!

-- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [1], or mute the thread [2].

*

Links:

[1] https://github.com/etmc/tmLQCD/issues/400#issuecomment-337500088 [2] https://github.com/notifications/unsubscribe-auth/ANNXmd4k-5oQlJoTNs_DIgJ_NryMP9fmks5stbUxgaJpZM4P9MOG

kostrzewa commented 6 years ago

@sunpho84 @robfre21 We shouldn't get ahead of ourselves. We've simply found a bug in the interface which prevented us to even run threaded inversions, but we still have somewhat mediocre scaling, at least from this first test. An improvement by a factor 2.7 by increasing the number of used cores by a factor of 7 is sufficiently poor to wash out all the improvement that you get on the QPhiX side by being able to run with more threads than MPI tasks. For the case that I tested thus, a highly threaded tmLQCD+QPhiX+DDalphaAMG job is about as fast (or slow, if you're a pessimist) as a high-MPI tmLQCD+QPhiX+DDalphaAMG job.

@sunpho84 If you could correct the single line and recompile in the version of tmLQCD that you're using, you can do a scaling study of DDalphAMG on Marconi A2 as a function of the MPI task <-> thread balance. This would be very helpful. It would be great to have that in the next few days.

kostrzewa commented 6 years ago

It does mean though, that one can scale to slightly higher node counts, leading to a reduction of wall clock time per trajectory, but reduced efficiency.

kostrzewa commented 6 years ago

@sbacchio @Finkenrath : TODO in DDalphaAMG interface: store current number of threads, start DDalphaAMG solve (which may change the number of active OpenMP threads) and reset it afterwards.

kostrzewa commented 5 years ago

This has been resolved I think.