eigenvalue computation fails to converge with MPI

etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.

http://www.itkp.uni-bonn.de/~urbach/software.html

GNU General Public License v3.0

32 stars 47 forks source link

eigenvalue computation fails to converge with MPI #110

Closed kostrzewa closed 12 years ago

kostrzewa commented 12 years ago

When MPI is enabled the eigenvalue computation does not converge. Why is that?

kostrzewa commented 12 years ago

Actually this is probably not true! But please correct me if I'm wrong.

urbach commented 12 years ago

well, the eigenvalue computation is fully parallelised. There is a part, that concerns only very small matrices, and these problems are solved on every MPI process identically. But of course one could take advantage of threading!?

kostrzewa commented 12 years ago

I've investigated this a bit more and I can see that as soon as MPI is used, the eigenvalue computation becomes extremely slow. I'm currently running a scalasca run on this to see what is going on. The computation is very fast when run serially or using OpenMP (since all of the underlying functions are OpenMP parallelised this works)

kostrzewa commented 12 years ago

Example: 8 mpi processes on a 4^4 lattice:

# Starting trajectory no 2
# Trajectory is not accepted.
# Writing gauge field to .conf.tmp.
# Write completed successfully.
# Renaming .conf.tmp to conf.save.
Number of minimal eigenvalues to compute = 1
Using Jacobi-Davidson method! 
Number of maximal eigenvalues to compute = 1
Using Jacobi-Davidson method! 
# PHMC: time/s for eigenvalue computation 9.569840e+01

95 seconds for computing eigenvalues?!

urbach commented 12 years ago

well, on a 4^4 lattice I can imagine that there is no big speedup. But as soon as the main time is spend in Dirac operator times vector operations, one should observe almost the same speedup as for the Dirac operator itself. On a small lattice other methods might outperform JD, since JD has a part that is not parallelised (and not so easily parallelisable over many CPUs, but easily with openMP over different threads)

It also has overhead, which might slow it down. I don't know about the 95 secs, though...

kostrzewa commented 12 years ago

It seems as though is is a regression from c99 complex or one of the other changes to etmc/master since 5.1.6. I just checked with 5.1.6 and the computation there is very fast. My first test with scalasca seems to indicate that something is happing in bicgstab_bi which should not be happening. %80 of total runtime is spent in communication from there. I will get back with more details in an hour or so.

kostrzewa commented 12 years ago

Ok, I hope I can describe this sufficiently clearly without some sort of screensharing.

What I see is that jdher_bi seems to have trouble converging in the current etmc/master branch. Both runs call jdher_bi 40 times in a run with 10 trajectories. However, 5.1.6 calls bicgstab_complex_bi ~ 1e3 times while etmc/master calls it ~ 4e4 times, easily explaining the runtime factor. (this is also true for all the other functions called directly from within jdher_bi)

Now the remaining question is why we have this convergence problem. I compiled both executables in exactly the same way and ran 4^4 lattices with 4 MPI processes in 2 dim paralellisation.

kostrzewa commented 12 years ago

The other weird thing is that in serial and OpenMP runs I don't witness this problem. I will investigate further and will get back here.

kostrzewa commented 12 years ago

Indeed, in serial runs there is no problem and all the subfunctions of jdher_bi are called approximately the same number of times. Also, the computation takes roughly the same number of seconds (with the current branch a tiny bit slower).

kostrzewa commented 12 years ago

I've found one possible regression in that update_backward_gauge was removed from jdher_bi for some reason. However, this does not seem to affect the actual functioning (as update_backward_gauge is also in other place, I presume)

However, let's take a look at phmc.data. Here's phmc.data for a 10 trajectory run computing eigenvalues every 2 trajectories.

ETMC/MASTER:

2 0.376519965572 6.92661e-310 6.92661e-310 1.35770e-02 3.09694e+00
4 0.440668511914 6.92661e-310 6.92661e-310 1.35770e-02 3.09694e+00
6 0.445760991942 6.92661e-310 6.92661e-310 1.35770e-02 3.09694e+00
8 0.457479606007 6.92661e-310 6.92661e-310 1.35770e-02 3.09694e+00
10 0.466260671791 6.92661e-310 6.92661e-310 1.35770e-02 3.09694e+00

5.1.6:

2 0.376519965572 1.18095e-02 8.72607e-01 1.35770e-02 3.09694e+00
4 0.440668511914 9.13598e-03 8.53129e-01 1.35770e-02 3.09694e+00
6 0.445760991942 9.61807e-03 8.52335e-01 1.35770e-02 3.09694e+00
8 0.457479606003 8.18784e-03 8.47115e-01 1.35770e-02 3.09694e+00
10 0.466260671775 6.26787e-03 8.50065e-01 1.35770e-02 3.09694e+00

Aha! If I remember correctly the second and third number are the smallest and largest eigenvalue respectively, yes? This explains then why convergence is so slow, but not why the numbers are the way they are!

kostrzewa commented 12 years ago

Comparing the output.data for these two runs, they are effectively the same (with respect to plaquette and deltaH as well as the rectangle part). Could there be an issue in how the Fortran function is called? Hmm, bu then the serial version would also be having trouble!

urbach commented 12 years ago

hmm, e-310 is a bit too small, I suppose... isn't there a lot of complex arithmetics needed in jdher? maybe a problem there? is the dimension of the arrays still computed correctly?

urbach commented 12 years ago

update_backward_gauge is called in the Dirac operator when needed.

kostrzewa commented 12 years ago

update_backward_gauge is called in the Dirac operator when needed.

well, yes

hmm, e-310 is a bit too small, I suppose... isn't there a lot of complex arithmetics needed in jdher? maybe a problem there? is the dimension of the arrays still computed correctly?

the serial code is fine so it can't be related to c99[1], there must be something more subtle going on such as the array dimension or so...

[1] and i checked the *_bi scalar products and norms, they are fine

kostrzewa commented 12 years ago

I looked at the debugging output in jdher_bi (print_status) and when MPI is enabled it simply doesn't converge. Sometimes the residual even blows up.

kostrzewa commented 12 years ago

So far I've managed to trace the problem to somewhere in the exchange routines. When doing an MPI run with just one MPI process, the algorithm converges fine. Further, it seems to be independent of the field exchange routines, as both the halfspinor and the full spinor version are affected, unless the bug is in both and sublte enough not to manifest in any other spinor exchange.

Given that we've been seeing some weird behaviour (null pointers etc.), maybe we have introduced some general problem in the exchange routines? Of course, this would have to be quite subtle to account for the fact that so many things work well.

kostrzewa commented 12 years ago

Serial runs of 5.1.6 and HEAD find the same maximal and minimal eigenvalues after one trajectory. This is also true for the MPI run of 5.1.6. Once MPI is on with more than one process in HEAD, the eigenvalues are different. However, and this is a trend, the maximal and minimal eigenvalues are the same!

Also, the eigenvalues in my current run are not of order e-310. I did an MPI run with 4 trajectories, computing eigenvalues at every trajectory and this results in the following output.data: http://www-zeuthen.desy.de/~kostrzew/quad.output

I'm starting to worry that we broke something about the polynomial itself when MPI is on, but sufficiently subtly for the plaquette evolution to be compatible with what it was like. Help?!

deuzeman commented 12 years ago

I haven't done any tracing of my own, so perhaps this remark will just show off my ignorance. But are we sure the raw MPI send/receive pairs are working properly within the C99 setup? I think we're now sending MPI_Complex types, but perhaps there is some subtle issue there?

kostrzewa commented 12 years ago

I don't know actually. The following are just some notes to keep track of my tests:

serial

For sample-hmc[0,1 and 4] 100 iterations produce compatible plaquette histories with rounding differences creeping in after about 50 iterations. The initial plaquette value is the same. Convergence times and iteration times are also compatible. This is not true for hmc[2 and 3] where differences are apparent from the 30th and second iteration respectively.

mpi

to be continued tomorrow

kostrzewa commented 12 years ago

Forget what I said about the initial plaquette measurement. Both MPI runs agree there. I confused some output (grrr...) However, there is something weirder going on. (see above)

kostrzewa commented 12 years ago

Some preliminary results of high statistics runs are here: http://www-zeuthen.desy.de/~kostrzew/expvals.pdf

I will add the same runs (except for the openmp/hybrid ones, of course) with 5.1.6. The MPI and serial runs were done with etmc/master, the OpenMP/hybrid runs with my omp-final branch.

I will keep that PDF updated as my jobs finish. Note that some points do not have full statistics, I will fix this.

"noreduct" means "hybrid without openmp reductions"

kostrzewa commented 12 years ago

Ok, I think this can actually be put to rest as I've looked at it in detail now and it is once again a case of me being silly. Weirdly enough though, with 5.1.6 the eigenvalue computation converges even though the CG doesn't when MPI is enabled and the polynomial is not good. However, I now have a nice setup for running long statistics runs and a nice set of reference data.

kostrzewa commented 12 years ago

I'm trying to compute the correct polynomial now, following the instructions in the 2+1+1 howto but I am consistently failing... I will need to do this "for real" soon anyway so I might just as well learn it now and get some good reference results out of it to boot.

kostrzewa commented 12 years ago

No, in fact something is still fishy and this brings us back to why I noticed this to begin with. (phew... i thought I had wasted a lot of time there...)

In 5.1.6 with the .dat's in the sample-input directory the eigenvalues are quite stable with respect to the local volume. One can run sample-hmc2.input with 1, 2 or 4 MPI processes without recomputing the polynomial. The NDPOLY force stays in the region of about 2e0. The min and max eigenvalues computed are reasonable around 0.01 and 0.9 and the computation converges quickly (ie, faster than serial, as expected).

Doing the same thing with etmc/master the molecular dynamics works similarly with 1, 2 or 4 MPI processes but the eigenvalue computation fails because JD does not converge. Despite this the trajectories are accepted (albeit very slowly) and the evolution progresses as usual. I'm doing a high stat run with 4 MPI processes to see what's going on with both 5.1.6 and etmc/master (disabling the eigenvalue computation in the process).

It find this highly unusual.

urbach commented 12 years ago

Hmm, after some searching, I don't understand whats going on. Actually, right now I am doubting that the original 5.1.6 code is correct in eigenvalues_bi...

urbach commented 12 years ago

okay, should work, now I remember...

Did you try before and after c99complex!?

urbach commented 12 years ago

I would guess it is 'Q_Qdagger_ND_BI' in 'Nondegenerate_Matrix.c', which is only used for the eigenvalue computation...

urbach commented 12 years ago

but that function didn't change at all...

kostrzewa commented 12 years ago

This can be closed now.