etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

Eigenvalue Computation fails on BG/Q #171

Closed urbach closed 11 years ago

urbach commented 11 years ago

When testing NDPOLY of NDCLOVER on BG/Q using master or NDTwistedClover branch, the eigenvalue solver gives clearly wrong results. For instance

# Computing eigenvalues for heavy doublet 
Number of minimal eigenvalues to compute = 1
Using Jacobi-Davidson method! 
JDHER execution statistics
IT_OUTER=61   IT_INNER_TOT=8712   IT_INNER_AVG=  142.82

Converged eigensolutions in order of convergence:

  I              LAMBDA(I)      RES(I)
  ---------------------------------------
  1 -9.430094804426073e-07  6.50088e-06

which cannot be... For the maximal eigenvalue the JD solver doesn't even converge. On my local PC on the other hand the scalar, MPI and openMP work just fine.

Could it be that there is again some problem with OpenMP only appearing with a large number of threads? Or the lapack usage in jdher?

To me it seems to be a problem which only appears in jdher_bi (and maybe jd_her, I didn't try) and might be therefore localised around Q_Qdagger_ND_BI (called Qtm_pm_ndbipsi in NDTwistedClover... But could of course also be that the complete polynomial still has a problem on BG/Q.

kostrzewa commented 11 years ago

Yes, I added all three. It's not your fault, I could have taken a closer look at the changes introduced and guessed that if you did so in NDCLOVER you'd have done it in NDPOLY too. I don't use the sample files from the repo anyway, I have my own "templates" for high statistics runs which are derived from the sample files and then parsed by my job generator.

At first sight it seems like everything is OK, the plaquette is in the same ballpark. We'll know for sure in a few days. I will try to compile a summary of the tests for the talk in Frankfurt.

urbach commented 11 years ago

also, copying the executable from juqueen to fermi seems to solve the problem... :( (I am not 100% sure that I took the right one, because I had to copy from JUDGE and couldn't recompile...)

kostrzewa commented 11 years ago

If you want you can try one of mine, hch028/code/tmLQCD.kost/build_bgq_hybrid_hs/hmc_tm, I'm just not sure whether this one will have spi and qpx enabled, but it was the last test I did so maybe it will. It is definitely build from NDTwistedClover.

urbach commented 11 years ago

no permissions for your directories...

urbach commented 11 years ago

juqueen back online, and the juqueen executable works on fermi

urbach commented 11 years ago

copying /usr/local/bg_soft/lapack/3.3.0/lib/liblapack.a from juqueen to fermi and compiling with this library instead of the one installed on fermi also solves the problem. The Lapack version is different (3.3.0 on juqueen versus 3.4.1 on fermi) Other than that I cannot judge... I'll email to fermi support...

Therefore, I think its rather unlikely that this is a problem of our code.

kostrzewa commented 11 years ago

Yes, but there's still the subtle difference between halfspinor and fullspinor with openmp which shouldn't be there... I don't have a good guess right now what is causing it though!

urbach commented 11 years ago

the problem was caused by a lapack library compile with -O3. After adding -qstrict everything works now. So I close this issue...