etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

possible regression in qphix_interface or qphix #374

Closed kostrzewa closed 5 years ago

kostrzewa commented 7 years ago

The final residual of a WILSON inversion is wrong when switching from my test version for JSC to the current HEAD commits of the devel branches.

=== qphix === kostrzewa/qphix@ccc4542613ae0879703b67c3633bc76d2b0b51f0 of https://github.com/kostrzewa/qphix/tree/juelich_qphix-tmf to JeffersonLab/qphix@6811b8b306453e04a319d565bd1108a37d7d2617 of https://github.com/JeffersonLab/qphix/tree/devel

=== tmLQCD === f3e1b38d012335491567783ff286c87d84826bee of https://github.com/kostrzewa/tmLQCD/tree/juelich_qphix_devel to 3f45be5842de0a6347dde7cdd532096397121c01 of https://github.com/kostrzewa/tmLQCD/tree/qphix_devel

kostrzewa commented 7 years ago

occurs already with avx512 on the juelich_qphix-tmf branch

we will need some automated test to figure out why

martin-ueding commented 7 years ago

Does Wilson clover work for you? If that works and Wilson stopped working, then it might be caused by the change in default checkerboard, see https://github.com/JeffersonLab/qphix/issues/32.

Otherwise looking through the changes with git diff -w ccc4542613ae0879703b67c3633bc76d2b0b51f0..6811b8b306453e04a319d565bd1108a37d7d2617 include/qphix/*.h does not show anything relevant to the Wilson case. So the problem might be within the kernels.

I currently compile QPhiX devel on Marconi A2 and will run the tests against QDP. If the kernels have a problem, this should surface there. Otherwise I do not see how the tests within QPhiX against QDP could succeed but the tests within tmLQCD against QPhiX fail all the sudden.

kostrzewa commented 7 years ago

I think it might have to do with compilation options, in particular the cross-compilation which is necessary to compile the AVX512 code on the Marconi login node. To be more specific: to compile the test programs in ./configure

kostrzewa commented 7 years ago

I can't test Wilson clover, since we don't have the packers yet...

kostrzewa commented 7 years ago

Does Wilson clover work for you? If that works and Wilson stopped working, then it might be caused by the change in default checkerboard, see JeffersonLab/qphix#32.

As for this, I already have problems when I compile https://github.com/kostrzewa/tmLQCD/tree/juelich_qphix_devel with AVX512. It might actually be a problem in the tmLQCD side (alignment, perhaps)

martin-ueding commented 7 years ago

But what has changed, then? QPhiX should not have changed in that regard. And also I am trying to figure out what I broke between those commits.

kostrzewa commented 7 years ago

I never tested AVX512, just AVX2 and that still works with the kernels in the juelich version, but not with qphix/devel. AVX512 gives inconsistent results for both. We'll see, probably something silly again.

martin-ueding commented 7 years ago

We are talking about the tests that use tmLQCD+QPhiX, right? I will try to run the AVX512 tests on Marconi A2 one I get something other than Bus Error when trying to run git log or git fetch 🙄. I notified the support team, hopefully that is resolved on Monday.

martin-ueding commented 7 years ago

All those tests work just fine on JURECA with https://github.com/JeffersonLab/qphix/commit/b135600264be7de27376fe2746c93c661c0e9788. Do the tests in tests and tests-gtest work for you? If so, then the QPhiX tests are not strict enough, irrelevant, or the issue is somewhere within the tmLQCD interface.

kostrzewa commented 7 years ago

No, I'm talking about doing an inversion and getting a wrong residual at the end.

kostrzewa commented 7 years ago

Regarding bus error, Cineca don't have any staff on weekends, so the machine "is in full production" although the file-system is down...

martin-ueding commented 7 years ago

Is there something that I can do about this issue on the QPhiX side?

kostrzewa commented 7 years ago

Okay, I couldn't sleep so I've pushed a mildly more streamlined version of the qphix interface to etmc/tmLQCD:qphix_devel

I've done tests using test_Dslash [1] which seems to work fine for me on my laptop. Since the same cannot be said for inversions, the issue must be somewhere in preparation or reconstruction, we'll see. [1] This now also sets the QphiX parameters from the input file, rather than having them hard-coded.

From now on, pull-requests should be directed to etmc/tmLQCD:qphix_devel, such that we can review them centrally. I will also make my own modifications that way, of course.

kostrzewa commented 7 years ago

The other thing that could be to blame is some inconsistency in the checkerboarding which has no effect when Dslash is applied but which surfaces during inversions.

kostrzewa commented 7 years ago

The residual problem seems to have been resolved in #378 , but I still need to do AVX512 checks on KNL. The build for #378 is still having trouble when QphiX is not available, we'll see.

kostrzewa commented 7 years ago

Okay, so the regression is fixed, AVX2 works on KNL too now, the problem was the checkerboarding in qphix_devel, which is fixed in etmc/qphix_devel.

AVX512 still does not work, but at least I can confirm that test_Dslash fails too, so it's not some obscure problem in the inversion. The AVX2 test_Dslash works fine on KNL.

martin-ueding commented 7 years ago

That sounds good, fixing KNL should then be somewhat less daunting.

kostrzewa commented 7 years ago

We might have something to go on now. AVX512 on KNL seems to work using GCC. It is extremely slow because I didn't appropriately bind threads to hyperthreads, but the inversion checks out.

I will try ICC16 next... let's hope we just have a problem with the compiler at CIneca...

kostrzewa commented 7 years ago

It appears that also ICC16 works, but with the same problem of being extremely slow for whatever reason. Of course, slowness and correctness could be related...

kostrzewa commented 7 years ago

I think we really need the QphiX tests, but I also think that getting all kernels working can run in parallel.

martin-ueding commented 7 years ago

Bálint's issue of failing QDP++ tests on ICC 17 is probably the same thing, then? The QPhiX tests that compare to the QDP++ implementations then will likely fail because the compiler breaks QDP++ and QPhiX implementations?

From a different angle: If we get everything to work with GCC on AVX512, we only have to wait for ICC to be fixed (or we discover some magic compile flag) and have a working version on KNL?

I think we really need the QphiX tests, but I also think that getting all kernels working can run in parallel.

I am not sure what you mean with that. Do you mean that I should get all test cases in QPhiX up and running?

kostrzewa commented 7 years ago

Bálint's issue of failing QDP++ tests on ICC 17 is probably the same thing, then? The QPhiX tests that compare to the QDP++ implementations then will likely fail because the compiler breaks QDP++ and QPhiX implementations?

Perhaps, we should certainly run tests on Marconi to check this.

From a different angle: If we get everything to work with GCC on AVX512, we only have to wait for ICC to be fixed (or we discover some magic compile flag) and have a working version on KNL?

Well, it will have to be a bit more active than "wait". We'll have to bug Cineca to check this and install a complete alternative compiler suite... that'll be fun.

I think we really need the QphiX tests, but I also think that getting all kernels working can run in parallel.

I am not sure what you mean with that. Do you mean that I should get all test cases in QPhiX up and running?

No, I meant that we can probably safely continue working on getting tm, tmclover and tm(clover) 1+1 working. It seems that we have correctness tests (using icc16 or gcc 6.1) to also check the avx512 kernels.

martin-ueding commented 7 years ago

The tests run fine (except for the Richardson solver for Wilson, might be checkerboarding) on KNL with ICC 17.0.1 20161005

This is strange because they test against QDP++ which seems to have some issues according to Bálint. And I doubt that ICC breaks QPhiX and QDP++ the exact same way.

kostrzewa commented 7 years ago

The tests run fine (except for the Richardson solver for Wilson, might be checkerboarding) on KNL with ICC 17.0.1 20161005 This is strange because they test against QDP++ which seems to have some issues according to Bálint. And I doubt that ICC breaks QPhiX and QDP++ the exact same way.

Okay, it could be a problem on the tmLQCD side then...

kostrzewa commented 7 years ago

Thanks, this is good to know.