etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

Eigenvalue Computation fails on BG/Q #171

Closed urbach closed 11 years ago

urbach commented 11 years ago

When testing NDPOLY of NDCLOVER on BG/Q using master or NDTwistedClover branch, the eigenvalue solver gives clearly wrong results. For instance

# Computing eigenvalues for heavy doublet 
Number of minimal eigenvalues to compute = 1
Using Jacobi-Davidson method! 
JDHER execution statistics
IT_OUTER=61   IT_INNER_TOT=8712   IT_INNER_AVG=  142.82

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
  ---------------------------------------
  1 -9.430094804426073e-07  6.50088e-06

which cannot be... For the maximal eigenvalue the JD solver doesn't even converge. On my local PC on the other hand the scalar, MPI and openMP work just fine.

Could it be that there is again some problem with OpenMP only appearing with a large number of threads? Or the lapack usage in jdher?

To me it seems to be a problem which only appears in jdher_bi (and maybe jd_her, I didn't try) and might be therefore localised around Q_Qdagger_ND_BI (called Qtm_pm_ndbipsi in NDTwistedClover... But could of course also be that the complete polynomial still has a problem on BG/Q.

kostrzewa commented 11 years ago

First thing to try would be a pure MPI compile to see if the problem persists, excluding a threading error if it does.

urbach commented 11 years ago

yes, thats already on my list...

urbach commented 11 years ago

and also with pure MPI the smallest eigenvalue is zero and the largest doesn't converge... very strange

urbach commented 11 years ago

also on 4^4 with 16 MPI processes, no SPI and no OpenMP the eigensolver doesn't converge. This situation works on PCs... I am slightly at a loss...

kostrzewa commented 11 years ago

I've been trying to compile the code with a different lapack but so far I haven"t succeeded... the other test is to try with gcc on bg/q and see what happens (other than being slow).

urbach commented 11 years ago

Problem is right now I don't even know what the problem is... Could be lapack, of course, but thats hard to verify. Did you by chance try with icc instead of gcc on intels? Just to understand whether or not it might be a bug in the code that by chance doesn't appear with gcc on intels?

kostrzewa commented 11 years ago

Oh yes, I always compile with icc. This is the output a few trajectories into the simulation with ndclover with csw=1.0

JDHER execution statistics

IT_OUTER=23   IT_INNER_TOT=72   IT_INNER_AVG=    3.13

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  1.771822832359480e-02  6.80991e-06

Number of maximal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=28   IT_INNER_TOT=38   IT_INNER_AVG=    1.36

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  8.323493703000371e-01  7.13061e-06

# PHMC: lowest eigenvalue end of trajectory 21 = 1.771823e-02
# PHMC: maximal eigenvalue end of trajectory 21 = 8.323494e-01
urbach commented 11 years ago

ah, very good... Are the eigenvalues the same as with gcc?

kostrzewa commented 11 years ago

Let's see:

# Computing eigenvalues for heavy doublet
Number of minimal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=22   IT_INNER_TOT=75   IT_INNER_AVG=    3.41

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  1.777861712783075e-02  8.70129e-06

Number of maximal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=28   IT_INNER_TOT=41   IT_INNER_AVG=    1.46

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  8.320246302694541e-01  9.99792e-06

# PHMC: lowest eigenvalue end of trajectory 21 = 1.777862e-02
# PHMC: maximal eigenvalue end of trajectory 21 = 8.320246e-01
kostrzewa commented 11 years ago

I don't know if to this level of precision these can be taken as being the same?

urbach commented 11 years ago

currently I'd say yes... ;-) Its the residue that is given, which I think cannot be translated one to one to the precision in the eigenvalue.

kostrzewa commented 11 years ago

I think there's something wrong somewhere. Running with gcc and halfspinor I get a segmentation fault when alignment is set to 32 (as it is by default through configure.in). When I "fix" this and alignment is set to 16 I don't get a segfault but the eigenvalues are different:

# Computing eigenvalues for heavy doublet
Number of minimal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=23   IT_INNER_TOT=60   IT_INNER_AVG=    2.61

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  9.371504922882037e-03  4.39667e-06

Number of maximal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=29   IT_INNER_TOT=120   IT_INNER_AVG=    4.14

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  7.377702864427558e-01  7.74445e-06

# PHMC: lowest eigenvalue end of trajectory 21 = 9.371505e-03
# PHMC: maximal eigenvalue end of trajectory 21 = 7.377703e-01

With icc this is what I get with halfspinor which is consistent:

# Computing eigenvalues for heavy doublet
Number of minimal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=23   IT_INNER_TOT=60   IT_INNER_AVG=    2.61

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  9.376967669099666e-03  4.38442e-06

Number of maximal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=30   IT_INNER_TOT=89   IT_INNER_AVG=    2.97

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  7.377609600177162e-01  6.73200e-06

# PHMC: lowest eigenvalue end of trajectory 21 = 9.376968e-03
# PHMC: maximal eigenvalue end of trajectory 21 = 7.377610e-01
urbach commented 11 years ago

the problem on BG/Q is that orthogonality is lost at some point. At that moment I see a large correction in ModifiedGS, which is not there in the PC run that I compare with.

In general the BG/Q and the PC run can be compared very well one by one. After about 8 iterations one sees first differences int he 8th digit, which then more or less immediately change the solver history and the residue...

kostrzewa commented 11 years ago

I will do another comparison before the first iteration because I think there's just a difference in the gauge field because the random numbers are different for half and full spinor...

urbach commented 11 years ago

why are they different?

kostrzewa commented 11 years ago

Here is the first eigenvalue computation for various types of executables:

gcc halfspinor without sse:

# PHMC: lowest eigenvalue end of trajectory 0 = 2.486690e-02
# PHMC: maximal eigenvalue end of trajectory 0 = 8.911642e-01

gcc with halfspinor and sse:

# PHMC: lowest eigenvalue end of trajectory 0 = 2.486690e-02
# PHMC: maximal eigenvalue end of trajectory 0 = 8.911642e-01

gcc fullspinor without sse:

# PHMC: lowest eigenvalue end of trajectory 0 = 2.486690e-02
# PHMC: maximal eigenvalue end of trajectory 0 = 8.911642e-01

gcc fullspinor with sse:

# PHMC: lowest eigenvalue end of trajectory 0 = 2.486690e-02
# PHMC: maximal eigenvalue end of trajectory 0 = 8.911642e-01

icc with halfspinor:

# PHMC: lowest eigenvalue end of trajectory 0 = 2.486690e-02
# PHMC: maximal eigenvalue end of trajectory 0 = 8.911642e-01

icc with fullspinor:

# PHMC: lowest eigenvalue end of trajectory 0 = 2.486690e-02
# PHMC: maximal eigenvalue end of trajectory 0 = 8.911642e-01

So the diffrerence was clearly from the random numbers, the computation itself is fine! (Note: this is all with OpenMP)

Actually, now I can't even reproduce the segfault... dunno what happened there then!

kostrzewa commented 11 years ago

why are they different?

this was more of an assumption... they shouldn't be, should they? okay, so this means I need to do some high statistics runs with more different executable versions...

urbach commented 11 years ago

I think it shouldn't... Hmm, we have too many configure options, don't we!?

kostrzewa commented 11 years ago

well, I wouldn't say too many, just many :)

kostrzewa commented 11 years ago

This is a compiler or intrinsics problem because it works with gcc on BG/Q. Next I will try to disable QPX with xlc. Btw, it also seems that SPI works just fine with gcc, just fyi. (this was a hybrid compile, btw)

Hmm, perhaps I spoke too soon about SPI with gcc. This was not a halfspinor build... indeed, there is a cross dependence on "enable-qpx", right? Yes indeed. Maybe there should be two flags: one for BGQ and one for QPX?

kostrzewa commented 11 years ago

Btw: is this normal?

# NDPOLY MD Polynomial: relative squared accuracy in components:
# UP=3.067848e-06  DN=0.000000e+00 
kostrzewa commented 11 years ago

Okay, so let's see:

GCC HALFSPINOR

# Computing eigenvalues for heavy doublet
Number of minimal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=27   IT_INNER_TOT=1032   IT_INNER_AVG=   38.22

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  5.697110499481280e-02  3.77313e-06

Number of maximal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=30   IT_INNER_TOT=141   IT_INNER_AVG=    4.70

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  7.883143239875542e-01  8.07932e-06

# PHMC: lowest eigenvalue end of trajectory 0 = 5.697110e-02
# PHMC: maximal eigenvalue end of trajectory 0 = 7.883143e-01

GCC FULLSPINOR

JDHER execution statistics

IT_OUTER=27   IT_INNER_TOT=1032   IT_INNER_AVG=   38.22

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  5.697110499481280e-02  3.77313e-06

Number of maximal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=30   IT_INNER_TOT=141   IT_INNER_AVG=    4.70

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  7.883143239875542e-01  8.07932e-06

# PHMC: lowest eigenvalue end of trajectory 0 = 5.697110e-02
# PHMC: maximal eigenvalue end of trajectory 0 = 7.883143e-01

XLC (NO QPX, NO SPI)

JDHER execution statistics

IT_OUTER=25   IT_INNER_TOT=1077   IT_INNER_AVG=   43.08

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  5.695736187909105e-02  4.27992e-06

Number of maximal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=28   IT_INNER_TOT=382   IT_INNER_AVG=   13.64

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  7.883143193293386e-01  6.53326e-06

# PHMC: lowest eigenvalue end of trajectory 0 = 5.695736e-02
# PHMC: maximal eigenvalue end of trajectory 0 = 7.883143e-01

XLC (QPX, NO SPI)

JDHER execution statistics

IT_OUTER=27   IT_INNER_TOT=1250   IT_INNER_AVG=   46.30

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  5.695734050566436e-02  4.12774e-07

Number of maximal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=29   IT_INNER_TOT=359   IT_INNER_AVG=   12.38

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  7.883143253778701e-01  9.78104e-06

# PHMC: lowest eigenvalue end of trajectory 0 = 5.695734e-02
# PHMC: maximal eigenvalue end of trajectory 0 = 7.883143e-01

Btw, it's interesting to see that performance in the solver is exactly the same between XLC and GCC! (that is without qpx and spi).

So in conclusion there is something wrong with SPI!

When Stefan was in Zeuthen he mentioned that it is very important to avoid race conditions when using SPI. In particular, he mentioned having to poll the sending of data as well as the reception because the descriptors are not independent.

kostrzewa commented 11 years ago

Hmm.. okay, now this is weird...:

**XLC (QPX, SPI)***

Number of minimal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=27   IT_INNER_TOT=1250   IT_INNER_AVG=   46.30

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  5.695734050566436e-02  4.12774e-07

Number of maximal eigenvalues to compute = 1
Using Jacobi-Davidson method! 

JDHER execution statistics

IT_OUTER=29   IT_INNER_TOT=359   IT_INNER_AVG=   12.38

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
---------------------------------------
  1  7.883143253778701e-01  9.78104e-06

# PHMC: lowest eigenvalue end of trajectory 0 = 5.695734e-02
# PHMC: maximal eigenvalue end of trajectory 0 = 7.883143e-01

works fine?

kostrzewa commented 11 years ago

I will try again with NDCLOVER to confirm that it also works there now.

kostrzewa commented 11 years ago

Hmm.. did you try this on Fermi? The results above are from JuQueen... also, the failure that I reported before was on Fermi!

kostrzewa commented 11 years ago

Trying it again with pure MPI (no SPI!) on Fermi with a40.24 and the eigensolver converges. I will now try a few more permutations and then go home :) Okay, the filesystem at cineca just crashed... I will try again tomorrow, sorry about the inconclusive end to this...

urbach commented 11 years ago

I was trying all the time on Fermi, with and without SPI, and it never converged. (It pretended to be converged, but the eigenvalues are wrong...)

kostrzewa commented 11 years ago

This was also without qpx, fyi. The eigenvalues were correct (ie the same as above)

actually no, plaese ignore that comment... too many things going on at the same time... I'm not a multitasking operating system..

On Fermi i see the same thing you do in that with MPI the "convergence" is faster (as compared to hybrid, spi, qpx...), but the eigenvalue is two orders of magnitude too small.

To be specific this is the small eigenvalue:

1.178051635490256e-04  1.35996e-06
urbach commented 11 years ago

that is not so small... Might be correct?!

kostrzewa commented 11 years ago

I really don't think so, it's the same run as above (a40.24) just on a different machine. Also, the computation of the maximal eigenvalue stalled for a long time until I aborted the job.

urbach commented 11 years ago

okay, good! I lost the overview a bit, can you shortly summarise what your findings are?

kostrzewa commented 11 years ago

Sure, there are many posts! On JuQueen with GCC or XLC with all permutations of options (spi,mpi,qpx,halfspinor,fullspinor - all hybrid!) the eigensolver converges and produces the same first eigenvalue. To do the first trajectory it takes almost 20 minutes, so I can't comment on eigenvalue development. On JuQueen I only tested a40.24 starting from a thermalized configuration (as that was the only one I had already set up)!

On Fermi I only tested with XLC (mpi,hybrid,halfspinor) and here the eigensolver for the maximal eigenvalue converges very badly (or does not converge, I aborted the job after waiting for 10 minutes) and produces a different eigenvalue for a40.24. This also applies to sample-hmc2 (tested hybrid only). I have not tested with GCC here so I can't say if it will work or not (and hence confirm that it is a compiler bug; it could also be a subtle network bug). For the small volume test with hmc2 I get negative eigenvalues as you saw with ndclover.

On Intel (and this is the only place where I computed O(20) eigenvalues) the initial eigenvalues (before iteration 1) are the same for a large permutation of options (hybrid, mpi, serial, gcc, icc, sse, nosse, halfspinor, fullspinor etc..). However, the eigenvalues diverge between halfspinor and fullspinor as the simulation progresses. I'm at a loss as to why exactly this is.

One further comment: I think SPI works with gcc (there are no linking errors), it would therefore be sensible (for testing purposes, really) to decouple QPX and SPI configure options completely. The main reason I would think this would be useful is to exclude interactions between the compiler and SPI when analyzing bugs in the future.

I'm in favour of introducing another configuration option "--enable-bgq" which will fix the architecture or alternatively, figure out if it is possible to autodetect the architecture as on L and P. This is for the reason above but also to ensure the correct optimization options. My current solution relies on "BGQ" (--enable-qpx) being defined to set the correct OpenMP and XLC options. This means that by default a compile without QPX is not compiled with the options intended.

urbach commented 11 years ago

thanks Bartek!

okay, so next I'll try to run an executable from Juqueen on Fermi to see, whether this fixes the problem.

concerning decoupling SPI and QPX: I thought that it was already decoupled (at least this was my intention). So, where is that coupling?

urbach commented 11 years ago

okay, Juqueen is down today, so I cannot test... My suspicion is that it is a lapack/blas/essl problem, but...

Concerning the difference in eigenvalues in between halflspinor and fullspinor version. Do also the plaquette values differ?

urbach commented 11 years ago

ah, maybe I know whats going on... Maybe its linking against the frontend essl and not against the compute node essl version...

kostrzewa commented 11 years ago

I need to leave soon and won't be available until Frankfurt (well, maybe tonight for a few minutes and on Wednesday the 23rd).. But here's a summary from my latest high statistics runs (now comparing full and halfspinor too, in addition to different parallelizations):

With openmp a difference between plaquettes develops after ~10 trajectories With hybrid a difference develops after ~10 trajectories With serial no difference develops after thousands of trajectories Wtih mpi no difference develops after thousands of trajectories

However: Between NDPOLY and NDCLOVER without csw, the plaquettes are consistent. The expectation values are also consistent between parallelizations.

The plaquettes for TMCLOVERDET with csw=1.0, 2kappamu=0.1105 and kappa=0.17 are consistent with NDCLOVER without splitting but with csw=1.0.

I'm not sure where the differences between halfspinor and fullspinor come from with OpenMP. They are mild and the expectation values seem consistent but I will only be able to tell once I have some time to print out the 50+ plots and compare carefully... In the last preview plot that I made the openmp expectation values did look a bit odd... Maybe some new bug snuck in?

urbach commented 11 years ago

so, its not the frontend library. So, going to try lapack from Juqueen...

kostrzewa commented 11 years ago

Oh, more importantly. I just noticed that the current NDPOLY produces different results from 5.1.6 and our previous versions (which were mutually consistent). (this is for the hmc2 sample file)

See for instance this comparison between serial (fullspinor) runs, old on the left:

0.534013791141 0.024662571356 9.756391e-01 77 667 1 8.800000e | 00003162 0.445523400850 0.616917036587 5.396055e-01 107 901 1

The plaquette is completely different and the number of iterations in the solver is huge. (this might be why you felt that the polynomial is so slow..)

urbach commented 11 years ago

well, random numbers have changed, so its not clear where the difference comes from. I always did all my changes by explicitly checking after every change that I get the same result...

kostrzewa commented 11 years ago

But that kind of difference should not be due to random numbers, otherwise where's the predictability? This is a 10% difference in the plaquette at a thermalized point.

kostrzewa commented 11 years ago

While

Oh, more importantly. I just noticed that the current NDPOLY produces different results from 5.1.6 and our previous versions (which were mutually consistent). (this is for the hmc2 sample file)

also

The plaquettes for TMCLOVERDET with csw=1.0, 2kappamu=0.1105 and kappa=0.17 are consistent with NDCLOVER without splitting but with csw=1.0.

I can only explain this dichotomy to myself by saying that something broke in NDPOLY, everything else is fine. Also, there is some mild problem with OpenMP that needs to be investigated more.

urbach commented 11 years ago

well, the difference after one trajectory from a random start is not really predictable, is it?

kostrzewa commented 11 years ago

that's after 3162 trajactories, it should be in the same ballpark

urbach commented 11 years ago

Ah, sorry... that shouldn't be the case!

Maybe because the input file changed? You now need to specify 2KappaMuBar, 2KappaEpsBar and kappa in the monomial, and its no longer enough to do it only in the global section (I have to change the sample input file!)

kostrzewa commented 11 years ago

The acceptance is also off for hmc2, over 100000 trajectories only 74% acceptance (used to be 87.16)

kostrzewa commented 11 years ago

Hmm, I added LocNormConst but I didn't in fact add epsbar and mubar. I will add them and try again. Perhaps read_input should bail on missing values there as well.. (results at the earliest on Friday evening because I'm going away... and I can't guarantee that)

urbach commented 11 years ago

yes, I need to fix the sample input file...

urbach commented 11 years ago

sorry for this! my fault!

urbach commented 11 years ago

I've updated sample-hmc2.input

urbach commented 11 years ago

please add also kappa!