Closed kostrzewa closed 11 years ago
is here a comment missing, or did you delete it?
I deleted it because what it said was wrong. I have some good news that I will share as soon as I get to work.
I'm curious...!
I started two sets of high statistics runs using the hmc2 sample file and the updated InterleavedNDTwistedClover branch yesterday. In one run I set reproducerandomnumbers to yes and in the other to no.
They are currently at between 120k and 200k trajectories and the relevant plots are here:
http://www-zeuthen.desy.de/~kostrzew/random_test/
Now not everything is roses yet because the hybrid code crashes with an MPI problem (which I will report seperately) but this is indeed looking much better. I will do an 8^4 run with the new code as well because it seems like there is no deviation anymore even though the run should still suffer from the "local volume" problem as Karl called it. Notice also the tiny errors on the plaquette compared to the reference. I hope this is not an underestimate for some silly reason.
I will also add another hmc2 run with a different seed, just to make sure!
I will do an 8^4 run with the new code as well because it seems like there is no deviation anymore even though the run should still suffer from the "local volume" problem as Karl called it.
Before wasting many core-hours of computing time I will do a hmc1 run though.
Okay, with a different seed the situation for hmc2 is unchanged. However, for hmc1 the discrepancy still arises. Maybe there's another point in the code where the random number flag is used incorrectly (and this is fixed by the polynomial call)? See the plots in this second run:
for hmc1 one should see the problem after exactly one trajectory, if there is a problem. Up to rounding any parallelisation must give exactly the same plaquette value after one trajectory now.
and for hmc1 I'm not sure that there is still a problem....
when reproducerandomnumbers=yes you are right (I'm running a "repro" job now for hmc1), but for the the situation shown above [1] the runs were quite short two hours ago, the effect is now more visible (web not updated yet)
[1] reproducerandomnumbers=no for runs without "repro" qualifier in name
Okay, I just updated the plots for random_test_2
reproducerandomnumbers=yes seems to make the first 10 to 20 trajectories or so reproducible between parallelizations
okay, then rounding errors hit, right?
okay, then rounding errors hit, right?
I would assume so, yes. It's a good test nonetheless, makes it much easier to spot issues!
Now to the current situation in the code: some "random_*" functions are still called with mnl->rngrepro which is set to 0 by default, i don't know whether it is set correctly anywhere though.. a bit hard to find out, what do you think?
also, index_jd calls with repro=0 hardcoded
I know, but this is never called...! I didn't see right now a proper way how to update everything.
it seems to me like mnl->rngrepro should always be set, as it is for the gauge monomials (in monomial/monomial.c)
that still doesn't explain though why there is still a discrepancy with reproducerandomnumbers=no
I am puzzled by this hmc1 "problem"...
Must be a problem of DETRATIO, doesn't it?
oh wait, I misread the conditional... it's working correctly then, sorry
i can do a hmc0 run to confirm, doesn't take very long
ah, maybe I just didn't read your plots correctly: whats the situation now:
1) hmc1 has still problems? 2) repro=no has still problems?
yeah, hmc0 would be important to know...
1) hmc1 has still problems?
correct
2) repro=no has still problems?
not for the hmc2 parameter set for hmc1 there is a bias for hmc1 with repro=yes there also seems to be a bias but the run is too short to say for now
but still with repro = 1 the first 10 to 20 trajectories are identical for hmc1?
openmp left, 3D MPI on right
00000001 0.305967319585 -1.633352120878 | 00000001 0.305967319585 -1.633352120881
00000002 0.406734913825 -0.964550040854 | 00000002 0.406734913825 -0.964550040857
00000003 0.453413283225 0.041172976406 | 00000003 0.453413283225 0.041172976401
00000004 0.489977234298 -0.507684198855 | 00000004 0.489977234298 -0.507684198861
00000005 0.516303270974 -0.394777057898 | 00000005 0.516303270974 -0.394777057903
00000006 0.548971168958 0.211143684605 | 00000006 0.548971168958 0.211143684603
00000007 0.563896158605 -0.311332349235 | 00000007 0.563896158605 -0.311332349243
00000008 0.581482575507 -0.486136112204 | 00000008 0.581482575507 -0.486136112209
00000009 0.595358373600 -0.273741605512 | 00000009 0.595358373600 -0.273741605519
00000010 0.606009779709 -0.553262722763 | 00000010 0.606009779709 -0.553262722769
and those two differ by a bias at the end? That would be very strange...!?
I don't know yet, we will see in an hour or so
so, definitly hmc1 has a problem with repro=0...?
if you look at hmc1_expvals.pdf in random_test_2 it certainly looks like it used to before
err, i mean hmc1_expvals!
Although the ultimate test will be one of the 8^4 runs with 8 fermions I think. There are no resources now but I might be able to start one tomorrow with two or three parallelizations and both repro=[yes,no]
although I don't really follow why the changes done to the polynomial initialisation should have an effect on all the other monomials (of course, the change to repro does have an effect everywhere!)
Morning, I've just updated the plots with the current results. For hmc2 for both seeds everything is working well for repro=0 and repro=1. ( random_test and random_test_2)
For hmc1 and hmc0 only the repro=1 situation is satisfactory. (random_test_2)
I will run a repro=1,V=8^4,Nf=8 run with and without clover term and Wilson action now to confirm that the bug (or at least its effect on the plaquette expectation value) is really gone in this situation.
concerning the this with hmc1 or hmc0 and repro=0, did someone have a look at my code? This afternoon I should find some time to work on this again.
I'm planning to take a look as well after lunch.
Okay, I'm seeing very good results in the 8^4 runs with repro=1 which made me look a bit more carefully at the code.
Here's a list of remaining issues that need to be fixed to get rid of this bug:
z2_random_spinor_field | needs a repro parameter |
random_jacobi_field | needs a repro parameter |
P_M_eta.c | calls random_spinor_field_lexic with repro=0 hardcoded |
index_jd.c | calls random_spinor_field with repro=0 hardcoded |
and that really seems to be it!
Okay, after going through the code a little less carefully than I had hoped and annoying the heck out of Carsten, I think that my comment above is the conclusion. To be adjusted:
z2_random_spinor_field, random_jacobi_field, P_M_eta.c, index_jd.c
epecially the first could be problematic if z2 noise is used for inversions because it seems like the repro=1 mode is necessary for the random number generator to be working correctly... Interestingly this was also the original state of the code around 2005. Only in 2006-2007 was the repro flag introduced.
What I mean to say is that perhaps it is not correct to run several ranluxes with different seeds in parallel, maybe the correlation between different chains is bigger than would be expected? Or maybe this correlation is to be expected? What do you think?
B.t.w, here's a work in progress of the 8^4 run without (mpihmc4) and with (mpihmc8) the clover term and repro=1.
z2_random_spinor_field is only used in invert for the mode number computation. For the other functions you are also correct, and they are also not used during the HMC. But they need to be adjusted to work properly as well.
The ranlux thing could be tested very easily. But I'd be very surprised...
What surprises me so much is that hmc0 and hmc2 both have a det monomial, and in the one case its working, but not in the other??
I agree, I was also very surprised about that. Will think about it some more tomorrow...
so maybe for the time beeing repro=1 should become default anyhow...
I think so because even hmc2 is showing differences without it now with full statistics. (all plots updated btw)
Could you try once with changing the way the local seed is computed
loc_seed = seed + g_proc_id;
and please increase the rlxd_level to 2:
RanluxdLevel = 2
just "+ g_proc_id" on purpose or some more elaborate scheme? (but not based on coords)
coudl be also g_cart_it, its just to get an idea. We could also drop the nstore dependence right now...
well i launch with nstore=0 anyway so i think that's fine, dropping the dependence would be problematic in my opinion because we would explicitly reproduce chains on continues or restarts
I can't even begin to explain how glad I am that I've invested the time to script everything from rebuilding 12 different versions of the software over making those jobscripts and submitting them to doing the analysis...
you'll get a medal!
hehe, I don't mean it like that... I was just referring to the humongous pain it would be to either do this manually or scramble to script it all now!
I'd like to use this issue to discuss the consequences of fixing the reproducerandomnumbers input parameter.