Closed kostrzewa closed 12 years ago
Actually this is probably not true! But please correct me if I'm wrong.
well, the eigenvalue computation is fully parallelised. There is a part, that concerns only very small matrices, and these problems are solved on every MPI process identically. But of course one could take advantage of threading!?
I've investigated this a bit more and I can see that as soon as MPI is used, the eigenvalue computation becomes extremely slow. I'm currently running a scalasca run on this to see what is going on. The computation is very fast when run serially or using OpenMP (since all of the underlying functions are OpenMP parallelised this works)
Example: 8 mpi processes on a 4^4 lattice:
# Starting trajectory no 2
# Trajectory is not accepted.
# Writing gauge field to .conf.tmp.
# Write completed successfully.
# Renaming .conf.tmp to conf.save.
Number of minimal eigenvalues to compute = 1
Using Jacobi-Davidson method!
Number of maximal eigenvalues to compute = 1
Using Jacobi-Davidson method!
# PHMC: time/s for eigenvalue computation 9.569840e+01
95 seconds for computing eigenvalues?!
well, on a 4^4 lattice I can imagine that there is no big speedup. But as soon as the main time is spend in Dirac operator times vector operations, one should observe almost the same speedup as for the Dirac operator itself. On a small lattice other methods might outperform JD, since JD has a part that is not parallelised (and not so easily parallelisable over many CPUs, but easily with openMP over different threads)
It also has overhead, which might slow it down. I don't know about the 95 secs, though...
It seems as though is is a regression from c99 complex or one of the other changes to etmc/master since 5.1.6. I just checked with 5.1.6 and the computation there is very fast. My first test with scalasca seems to indicate that something is happing in bicgstab_bi which should not be happening. %80 of total runtime is spent in communication from there. I will get back with more details in an hour or so.
Ok, I hope I can describe this sufficiently clearly without some sort of screensharing.
What I see is that jdher_bi seems to have trouble converging in the current etmc/master branch. Both runs call jdher_bi 40 times in a run with 10 trajectories. However, 5.1.6 calls bicgstab_complex_bi ~ 1e3 times while etmc/master calls it ~ 4e4 times, easily explaining the runtime factor. (this is also true for all the other functions called directly from within jdher_bi)
Now the remaining question is why we have this convergence problem. I compiled both executables in exactly the same way and ran 4^4 lattices with 4 MPI processes in 2 dim paralellisation.
The other weird thing is that in serial and OpenMP runs I don't witness this problem. I will investigate further and will get back here.
Indeed, in serial runs there is no problem and all the subfunctions of jdher_bi are called approximately the same number of times. Also, the computation takes roughly the same number of seconds (with the current branch a tiny bit slower).
I've found one possible regression in that update_backward_gauge was removed from jdher_bi for some reason. However, this does not seem to affect the actual functioning (as update_backward_gauge is also in other place, I presume)
However, let's take a look at phmc.data. Here's phmc.data for a 10 trajectory run computing eigenvalues every 2 trajectories.
ETMC/MASTER:
2 0.376519965572 6.92661e-310 6.92661e-310 1.35770e-02 3.09694e+00
4 0.440668511914 6.92661e-310 6.92661e-310 1.35770e-02 3.09694e+00
6 0.445760991942 6.92661e-310 6.92661e-310 1.35770e-02 3.09694e+00
8 0.457479606007 6.92661e-310 6.92661e-310 1.35770e-02 3.09694e+00
10 0.466260671791 6.92661e-310 6.92661e-310 1.35770e-02 3.09694e+00
5.1.6:
2 0.376519965572 1.18095e-02 8.72607e-01 1.35770e-02 3.09694e+00
4 0.440668511914 9.13598e-03 8.53129e-01 1.35770e-02 3.09694e+00
6 0.445760991942 9.61807e-03 8.52335e-01 1.35770e-02 3.09694e+00
8 0.457479606003 8.18784e-03 8.47115e-01 1.35770e-02 3.09694e+00
10 0.466260671775 6.26787e-03 8.50065e-01 1.35770e-02 3.09694e+00
Aha! If I remember correctly the second and third number are the smallest and largest eigenvalue respectively, yes? This explains then why convergence is so slow, but not why the numbers are the way they are!
Comparing the output.data for these two runs, they are effectively the same (with respect to plaquette and deltaH as well as the rectangle part). Could there be an issue in how the Fortran function is called? Hmm, bu then the serial version would also be having trouble!
hmm, e-310 is a bit too small, I suppose... isn't there a lot of complex arithmetics needed in jdher? maybe a problem there? is the dimension of the arrays still computed correctly?
update_backward_gauge is called in the Dirac operator when needed.
update_backward_gauge is called in the Dirac operator when needed.
well, yes
hmm, e-310 is a bit too small, I suppose... isn't there a lot of complex arithmetics needed in jdher? maybe a problem there? is the dimension of the arrays still computed correctly?
the serial code is fine so it can't be related to c99[1], there must be something more subtle going on such as the array dimension or so...
[1] and i checked the *_bi scalar products and norms, they are fine
I looked at the debugging output in jdher_bi (print_status) and when MPI is enabled it simply doesn't converge. Sometimes the residual even blows up.
So far I've managed to trace the problem to somewhere in the exchange routines. When doing an MPI run with just one MPI process, the algorithm converges fine. Further, it seems to be independent of the field exchange routines, as both the halfspinor and the full spinor version are affected, unless the bug is in both and sublte enough not to manifest in any other spinor exchange.
Given that we've been seeing some weird behaviour (null pointers etc.), maybe we have introduced some general problem in the exchange routines? Of course, this would have to be quite subtle to account for the fact that so many things work well.
Serial runs of 5.1.6 and HEAD find the same maximal and minimal eigenvalues after one trajectory. This is also true for the MPI run of 5.1.6. Once MPI is on with more than one process in HEAD, the eigenvalues are different. However, and this is a trend, the maximal and minimal eigenvalues are the same!
Also, the eigenvalues in my current run are not of order e-310. I did an MPI run with 4 trajectories, computing eigenvalues at every trajectory and this results in the following output.data: http://www-zeuthen.desy.de/~kostrzew/quad.output
I'm starting to worry that we broke something about the polynomial itself when MPI is on, but sufficiently subtly for the plaquette evolution to be compatible with what it was like. Help?!
I haven't done any tracing of my own, so perhaps this remark will just show off my ignorance. But are we sure the raw MPI send/receive pairs are working properly within the C99 setup? I think we're now sending MPI_Complex types, but perhaps there is some subtle issue there?
I don't know actually. The following are just some notes to keep track of my tests:
For sample-hmc[0,1 and 4] 100 iterations produce compatible plaquette histories with rounding differences creeping in after about 50 iterations. The initial plaquette value is the same. Convergence times and iteration times are also compatible. This is not true for hmc[2 and 3] where differences are apparent from the 30th and second iteration respectively.
to be continued tomorrow
Forget what I said about the initial plaquette measurement. Both MPI runs agree there. I confused some output (grrr...) However, there is something weirder going on. (see above)
Some preliminary results of high statistics runs are here: http://www-zeuthen.desy.de/~kostrzew/expvals.pdf
I will add the same runs (except for the openmp/hybrid ones, of course) with 5.1.6. The MPI and serial runs were done with etmc/master, the OpenMP/hybrid runs with my omp-final branch.
I will keep that PDF updated as my jobs finish. Note that some points do not have full statistics, I will fix this.
"noreduct" means "hybrid without openmp reductions"
Ok, I think this can actually be put to rest as I've looked at it in detail now and it is once again a case of me being silly. Weirdly enough though, with 5.1.6 the eigenvalue computation converges even though the CG doesn't when MPI is enabled and the polynomial is not good. However, I now have a nice setup for running long statistics runs and a nice set of reference data.
I'm trying to compute the correct polynomial now, following the instructions in the 2+1+1 howto but I am consistently failing... I will need to do this "for real" soon anyway so I might just as well learn it now and get some good reference results out of it to boot.
No, in fact something is still fishy and this brings us back to why I noticed this to begin with. (phew... i thought I had wasted a lot of time there...)
In 5.1.6 with the .dat's in the sample-input directory the eigenvalues are quite stable with respect to the local volume. One can run sample-hmc2.input with 1, 2 or 4 MPI processes without recomputing the polynomial. The NDPOLY force stays in the region of about 2e0. The min and max eigenvalues computed are reasonable around 0.01 and 0.9 and the computation converges quickly (ie, faster than serial, as expected).
Doing the same thing with etmc/master the molecular dynamics works similarly with 1, 2 or 4 MPI processes but the eigenvalue computation fails because JD does not converge. Despite this the trajectories are accepted (albeit very slowly) and the evolution progresses as usual. I'm doing a high stat run with 4 MPI processes to see what's going on with both 5.1.6 and etmc/master (disabling the eigenvalue computation in the process).
It find this highly unusual.
Hmm, after some searching, I don't understand whats going on. Actually, right now I am doubting that the original 5.1.6 code is correct in eigenvalues_bi...
okay, should work, now I remember...
Did you try before and after c99complex!?
I would guess it is 'Q_Qdagger_ND_BI' in 'Nondegenerate_Matrix.c', which is only used for the eigenvalue computation...
but that function didn't change at all...
This can be closed now.
When MPI is enabled the eigenvalue computation does not converge. Why is that?