etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

Juwels DDalphaAMG setup exits with Signal 11 #467

Closed kostrzewa closed 3 years ago

kostrzewa commented 5 years ago

@sbacchio @Finkenrath I'm putting the finishing touches on the code to run on Juwels / SuperMUC-NG for new production / continuation. With the current software stage (not sure about other stages in combination with this particular version yet), the DDalphaAMG setup in the HMC (triggered by a light monomial) exits with signal 11. Did you observe anything similar on SuperMUC-NG?

Finkenrath commented 5 years ago

Yes, I face similar problems for the case using IntelMPI19 on both machines. I switched back to IntelMPI18, which seems to be fine.

kostrzewa commented 5 years ago

I see, thanks! So we get consistent behaviour. I switched back all the way to stage-2018a on Juwels (ICC 2018 / Intel MPI 2018) and this seems to work as expected. Did you do the same or did you use ICC 2019 and IntelMPI 2018?

Finkenrath commented 5 years ago

On SuperMUC I am currently using icc 18.0.5. I checked on JUWELS and in the last runs I was using the modules "ParaStationMPI/5.2.1-1" with "Intel/2019.0.117-GCC-7.3.0" (which is icc 19). I initially run with Intel 18 (icc + IntelMPI) due to the scaling problems of IntelMPI19 on JUWELS at the beginning of 2019 (not sure if they are solved). I think on SuperMUC I directly switch to Intel 2018, however I would like to re-run some benchmarks using the updated environment with the energy optimization options (maybe I can try to revisit IntelMPI19 at this time).

Finkenrath commented 5 years ago

I recently run some checks on Juwels with icc/2019.3.199-GCC-8.3.0. By reducing the compiler optimization to -O2, DDalphaAMG do not exit with signal 11 (however with -O3 the issue is still there).

kostrzewa commented 5 years ago

Interesting, thanks for the update! Might be some instruction reordering thing then...

kostrzewa commented 5 years ago

@Finkenrath just a follow-up question: did you compile just DDalphaAMG with -O2 or both the library and tmLQCD?

Finkenrath commented 5 years ago

I tested it only with DDalphaAMG executable and I didn't run tmLQCD yet. If I have some results on that I will let you know.

kostrzewa commented 5 years ago

I see, sounds good, thanks.