etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

reproducerandomnumbers consequences #179

Closed kostrzewa closed 11 years ago

kostrzewa commented 11 years ago

I'd like to use this issue to discuss the consequences of fixing the reproducerandomnumbers input parameter.

urbach commented 11 years ago

is here a comment missing, or did you delete it?

kostrzewa commented 11 years ago

I deleted it because what it said was wrong. I have some good news that I will share as soon as I get to work.

urbach commented 11 years ago

I'm curious...!

kostrzewa commented 11 years ago

I started two sets of high statistics runs using the hmc2 sample file and the updated InterleavedNDTwistedClover branch yesterday. In one run I set reproducerandomnumbers to yes and in the other to no.

They are currently at between 120k and 200k trajectories and the relevant plots are here:

http://www-zeuthen.desy.de/~kostrzew/random_test/

Now not everything is roses yet because the hybrid code crashes with an MPI problem (which I will report seperately) but this is indeed looking much better. I will do an 8^4 run with the new code as well because it seems like there is no deviation anymore even though the run should still suffer from the "local volume" problem as Karl called it. Notice also the tiny errors on the plaquette compared to the reference. I hope this is not an underestimate for some silly reason.

I will also add another hmc2 run with a different seed, just to make sure!

kostrzewa commented 11 years ago

I will do an 8^4 run with the new code as well because it seems like there is no deviation anymore even though the run should still suffer from the "local volume" problem as Karl called it.

Before wasting many core-hours of computing time I will do a hmc1 run though.

kostrzewa commented 11 years ago

Okay, with a different seed the situation for hmc2 is unchanged. However, for hmc1 the discrepancy still arises. Maybe there's another point in the code where the random number flag is used incorrectly (and this is fixed by the polynomial call)? See the plots in this second run:

http://www-zeuthen.desy.de/%7Ekostrzew/random_test_2/

urbach commented 11 years ago

for hmc1 one should see the problem after exactly one trajectory, if there is a problem. Up to rounding any parallelisation must give exactly the same plaquette value after one trajectory now.

urbach commented 11 years ago

and for hmc1 I'm not sure that there is still a problem....

kostrzewa commented 11 years ago

when reproducerandomnumbers=yes you are right (I'm running a "repro" job now for hmc1), but for the the situation shown above [1] the runs were quite short two hours ago, the effect is now more visible (web not updated yet)

[1] reproducerandomnumbers=no for runs without "repro" qualifier in name

kostrzewa commented 11 years ago

Okay, I just updated the plots for random_test_2

kostrzewa commented 11 years ago

reproducerandomnumbers=yes seems to make the first 10 to 20 trajectories or so reproducible between parallelizations

urbach commented 11 years ago

okay, then rounding errors hit, right?

kostrzewa commented 11 years ago

okay, then rounding errors hit, right?

I would assume so, yes. It's a good test nonetheless, makes it much easier to spot issues!

Now to the current situation in the code: some "random_*" functions are still called with mnl->rngrepro which is set to 0 by default, i don't know whether it is set correctly anywhere though.. a bit hard to find out, what do you think?

kostrzewa commented 11 years ago

also, index_jd calls with repro=0 hardcoded

urbach commented 11 years ago

I know, but this is never called...! I didn't see right now a proper way how to update everything.

kostrzewa commented 11 years ago

it seems to me like mnl->rngrepro should always be set, as it is for the gauge monomials (in monomial/monomial.c)

kostrzewa commented 11 years ago

that still doesn't explain though why there is still a discrepancy with reproducerandomnumbers=no

urbach commented 11 years ago

I am puzzled by this hmc1 "problem"...

Must be a problem of DETRATIO, doesn't it?

kostrzewa commented 11 years ago

oh wait, I misread the conditional... it's working correctly then, sorry

kostrzewa commented 11 years ago

i can do a hmc0 run to confirm, doesn't take very long

urbach commented 11 years ago

ah, maybe I just didn't read your plots correctly: whats the situation now:

1) hmc1 has still problems? 2) repro=no has still problems?

urbach commented 11 years ago

yeah, hmc0 would be important to know...

kostrzewa commented 11 years ago

1) hmc1 has still problems?

correct

2) repro=no has still problems?

not for the hmc2 parameter set for hmc1 there is a bias for hmc1 with repro=yes there also seems to be a bias but the run is too short to say for now

urbach commented 11 years ago

but still with repro = 1 the first 10 to 20 trajectories are identical for hmc1?

kostrzewa commented 11 years ago

openmp left, 3D MPI on right

00000001 0.305967319585 -1.633352120878    |    00000001 0.305967319585 -1.633352120881
00000002 0.406734913825 -0.964550040854    |    00000002 0.406734913825 -0.964550040857
00000003 0.453413283225 0.041172976406     |    00000003 0.453413283225 0.041172976401 
00000004 0.489977234298 -0.507684198855    |    00000004 0.489977234298 -0.507684198861
00000005 0.516303270974 -0.394777057898    |    00000005 0.516303270974 -0.394777057903
00000006 0.548971168958 0.211143684605     |    00000006 0.548971168958 0.211143684603 
00000007 0.563896158605 -0.311332349235    |    00000007 0.563896158605 -0.311332349243
00000008 0.581482575507 -0.486136112204    |    00000008 0.581482575507 -0.486136112209
00000009 0.595358373600 -0.273741605512    |    00000009 0.595358373600 -0.273741605519
00000010 0.606009779709 -0.553262722763    |    00000010 0.606009779709 -0.553262722769
urbach commented 11 years ago

and those two differ by a bias at the end? That would be very strange...!?

kostrzewa commented 11 years ago

I don't know yet, we will see in an hour or so

urbach commented 11 years ago

so, definitly hmc1 has a problem with repro=0...?

kostrzewa commented 11 years ago

if you look at hmc1_expvals.pdf in random_test_2 it certainly looks like it used to before

kostrzewa commented 11 years ago

err, i mean hmc1_expvals!

kostrzewa commented 11 years ago

Although the ultimate test will be one of the 8^4 runs with 8 fermions I think. There are no resources now but I might be able to start one tomorrow with two or three parallelizations and both repro=[yes,no]

kostrzewa commented 11 years ago

although I don't really follow why the changes done to the polynomial initialisation should have an effect on all the other monomials (of course, the change to repro does have an effect everywhere!)

kostrzewa commented 11 years ago

Morning, I've just updated the plots with the current results. For hmc2 for both seeds everything is working well for repro=0 and repro=1. ( random_test and random_test_2)

For hmc1 and hmc0 only the repro=1 situation is satisfactory. (random_test_2)

I will run a repro=1,V=8^4,Nf=8 run with and without clover term and Wilson action now to confirm that the bug (or at least its effect on the plaquette expectation value) is really gone in this situation.

urbach commented 11 years ago

concerning the this with hmc1 or hmc0 and repro=0, did someone have a look at my code? This afternoon I should find some time to work on this again.

kostrzewa commented 11 years ago

I'm planning to take a look as well after lunch.

kostrzewa commented 11 years ago

Okay, I'm seeing very good results in the 8^4 runs with repro=1 which made me look a bit more carefully at the code.

Here's a list of remaining issues that need to be fixed to get rid of this bug:

z2_random_spinor_fieldneeds a repro parameter
random_jacobi_fieldneeds a repro parameter
P_M_eta.ccalls random_spinor_field_lexic with repro=0 hardcoded
index_jd.ccalls random_spinor_field with repro=0 hardcoded

and that really seems to be it!

kostrzewa commented 11 years ago

Okay, after going through the code a little less carefully than I had hoped and annoying the heck out of Carsten, I think that my comment above is the conclusion. To be adjusted:

z2_random_spinor_field, random_jacobi_field, P_M_eta.c, index_jd.c

epecially the first could be problematic if z2 noise is used for inversions because it seems like the repro=1 mode is necessary for the random number generator to be working correctly... Interestingly this was also the original state of the code around 2005. Only in 2006-2007 was the repro flag introduced.

What I mean to say is that perhaps it is not correct to run several ranluxes with different seeds in parallel, maybe the correlation between different chains is bigger than would be expected? Or maybe this correlation is to be expected? What do you think?

kostrzewa commented 11 years ago

B.t.w, here's a work in progress of the 8^4 run without (mpihmc4) and with (mpihmc8) the clover term and repro=1.

http://www-zeuthen.desy.de/%7Ekostrzew/random_test_8

urbach commented 11 years ago

z2_random_spinor_field is only used in invert for the mode number computation. For the other functions you are also correct, and they are also not used during the HMC. But they need to be adjusted to work properly as well.

The ranlux thing could be tested very easily. But I'd be very surprised...

urbach commented 11 years ago

What surprises me so much is that hmc0 and hmc2 both have a det monomial, and in the one case its working, but not in the other??

kostrzewa commented 11 years ago

I agree, I was also very surprised about that. Will think about it some more tomorrow...

urbach commented 11 years ago

so maybe for the time beeing repro=1 should become default anyhow...

kostrzewa commented 11 years ago

I think so because even hmc2 is showing differences without it now with full statistics. (all plots updated btw)

urbach commented 11 years ago

Could you try once with changing the way the local seed is computed

loc_seed = seed + g_proc_id;

and please increase the rlxd_level to 2:

RanluxdLevel = 2
kostrzewa commented 11 years ago

just "+ g_proc_id" on purpose or some more elaborate scheme? (but not based on coords)

urbach commented 11 years ago

coudl be also g_cart_it, its just to get an idea. We could also drop the nstore dependence right now...

kostrzewa commented 11 years ago

well i launch with nstore=0 anyway so i think that's fine, dropping the dependence would be problematic in my opinion because we would explicitly reproduce chains on continues or restarts

kostrzewa commented 11 years ago

I can't even begin to explain how glad I am that I've invested the time to script everything from rebuilding 12 different versions of the software over making those jobscripts and submitting them to doing the analysis...

urbach commented 11 years ago

you'll get a medal!

kostrzewa commented 11 years ago

hehe, I don't mean it like that... I was just referring to the humongous pain it would be to either do this manually or scramble to script it all now!