etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

reasons for allowing rank reordering? #555

Open kostrzewa opened 1 year ago

kostrzewa commented 1 year ago

@urbach Do you perhaps remember why rank reordering was allowed at the time? (almost 18 years ago :) )

It seems that the HPE engineers were able to find the culprit for our problems on LUMI-G and I think it might be as simple as switching this to 0.

https://github.com/etmc/tmLQCD/blob/443a08ff341590d8c3509a4ed4e06330418f71fa/mpi_init.c#L216

urbach commented 1 year ago

@urbach Do you perhaps remember why rank reordering was allowed at the time? (almost 18 years ago :) )

It seems that the HPE engineers were able to find the culprit for our problems on LUMI-G and I think it might be as simple as switching this to 0.

https://github.com/etmc/tmLQCD/blob/443a08ff341590d8c3509a4ed4e06330418f71fa/mpi_init.c#L216

no, I don't remember this anymore. Could be that I introduced this for domain decomposition. I'd say let's try with setting this to '0'!

kostrzewa commented 1 year ago

Thanks. Yes, we'll have to do a number of test runs on various machines to make sure that it doesn't break anything elsewhere...

I have a suspicion that it might have been relevant on the torus networks on the BG/L and /P and in particular later on for the /Q (we never changed it since 2005 though). I don't know what kind of network the IBM p690 at JSC was configured with. Maybe it was relevant there already?

urbach commented 1 year ago

Thanks. Yes, we'll have to do a number of test runs on various machines to make sure that it doesn't break anything elsewhere...

yes, agree!

I have a suspicion that it might have been relevant on the torus networks on the BG/L and /P and in particular later on for the /Q (we never changed it since 2005 though). I don't know what kind of network the IBM p690 at JSC was configured with. Maybe it was relevant there already?

no, I never programmed for the network of the p690 directly. Blue Gene might be...

kostrzewa commented 1 year ago

Tests