Closed kostrzewa closed 11 years ago
I'm doing another 50 trajectories with pure OpenMP to confirm the behaviour of the m_PCAC measurement and then I will switch back to pure MPI with reproducerandomnumbers=no to see whether that's making the pure MPI part so "good" in terms of low variance.
Finally, I will add another 100 or so trajectories using a hybrid version of the code.
I DON'T think this is due to the way that the RNG is reinitialized in source_generation_pion_only. The only way to confirm would be to run serially... that's very impractical for this lattice size though. (although I could let it run for a day or so and get around 30 trajectories...)
I don't really know where to look at this point although obvious culprits are all the functions using omp atomic
, although I would expect that to show in the plaquette too (I guess it does, in a way).
OK, so it's clearly not due to reproducerandomnumbers
. From trajectory 494 onwards we're back to pure MPI (with reproducerandomnumbers=no
) and this is what it looks like. I will try hybrid next.
Hi Bartek!
The first thing we should try to exclude is that something in this plot of m_pcac versus t_hmc goes wrong. Thats unlikely, I agree.
Second is to check the part doing the contraction for the online measurements. Anything that could go wrong there in the openMP and/or MPI case?
Is there maybe some precision issue in the pure open MP inversions for the measurements? Does this fluctuation show up in the PP and the PA, or in only one of them? How can I reproduce this? Does it also show up on 4^4 lattices?
Using the reproduce rng feature, do you still get identical plaquette values after one trajectory?
I would find it highly surprising if there was a deeper problem that escaped all your hight statistics tests. Moreover, the online measurements have been crosschecked with offline measurements, confirming the online measurements. But Petros/Roberto observed somewhat too large fluctuations, as they wrote in their emails, which they attributed to too loose precision in the inversions.
I've tried 4^3x8 serial versus openMP, just 10 trajectories, with online measurements after each trajectory. Here is the analysis
> res$dpaopp
t mass dmass ddmass tauint dtauint
1 1 -0.18901415 0.1410771 0.05463891 0.6676308 0.3852338
2 2 0.11720897 0.0578029 0.02890145 0.4258077 0.3878537
3 3 0.05189274 0.1036935 0.05184675 1.1022011 0.8241631
> res1$dpaopp
t mass dmass ddmass tauint dtauint
1 1 -0.18901415 0.1410771 0.05463891 0.6676308 0.3852338
2 2 0.11720897 0.0578029 0.02890145 0.4258077 0.3878537
3 3 0.05189274 0.1036935 0.05184675 1.1022011 0.8241631
> res$MChist.dpaopp
[1] 0.288916018 0.190761577 -0.108701890 0.052108648 -0.057439508
[6] -0.248466599 -0.001712946 0.381267512 -0.185376527 -0.719809641
> res1$MChist.dpaopp
[1] 0.288916018 0.190761577 -0.108701890 0.052108648 -0.057439508
[6] -0.248466599 -0.001712946 0.381267512 -0.185376527 -0.719809641
res
is the version with OMP_NUM_THREADS=1
and res1
the one with OMP_NUM_THREADS=4
. They agree perfectly, I'd say. also the plaquette agrees perfectly. The same if I compile without OMP support. MPI I'll try tomorrow...
with or without SSE and with or without halfspinor also agree. All with gcc, though...
Thanks for the cross-check, that's certainly very encouraging, it might turn out to just be a compiler problem. My hybrid run shows the same issue although it is somewhat milder. All my executables are compiled with icc. I will continue testing today.
Second is to check the part doing the contraction for the online measurements. Anything that could go wrong there in the openMP and/or MPI case?
Could it just be round-off? There is a very large accumulation done in online_measurement.c
When MPI is used, the local volume is much smaller and hence these sums are relatively short. As a consequence when the MPI sum is done, numbers of similar size are added together and the result will be more stable with respect to round-off.
This would also adequately explain why there is no difference between serial/openmp (which I just checked and can confirm, albeit for just 4^4 and a mass that is 10 times heavier than in my run) It would also help in explaining why the hybrid run is less affected.
When my current openmp run is done I will add a Kahan summation into the online measurement and we'll try again. I even have a model for this in my PP correlator test code (that I told you about) to add some OpenMP parallelization to the computation of Cxx[t] with OpenMP parallelization in t, including a Kahan summation. But one step at a time.
Does this fluctuation show up in the PP and the PA, or in only one of them?
I have to admit I haven't checked explicitly on a per-correlator basis. I was only looking at the pcac mass history.
How can I reproduce this?
run the same thing with pure mpi and pure openmp and compare m_pcac histories, alternatively do half of the run with one parallelization, half with the other
Does it also show up on 4^4 lattices?
not sure, haven't tried yet, given my argument above it's unlikely but possible
If there is a problem you should see it by comparing after only one trajectory, shouldn't you?
If there is a problem you should see it by comparing after only one trajectory, shouldn't you?
Yes, absolutely, see below.
When my current openmp run is done I will add a Kahan summation into the online measurement and we'll try again.
I added a Kahan summation here but I see NO difference at all in onlinemeas.* for a test-run at 8^3x16 between Kahan and no Kahan:
11:54 kostrzew@blade8b ~/code/tmLQCD.kost/build_openmp_wgs (etmcmaster|✚3…) $ diff -y -W 100 onlinemeas.000000 onlinemeas.000000.omp16_kahan
1 1 0 2.493487e+01 0.000000e+00 1 1 0 2.493487e+01 0.000000e+00
1 1 1 3.632671e+00 2.436011e+00 1 1 1 3.632671e+00 2.436011e+00
1 1 2 5.727120e-01 2.451563e-01 1 1 2 5.727120e-01 2.451563e-01
1 1 3 7.388951e-02 4.784140e-02 1 1 3 7.388951e-02 4.784140e-02
1 1 4 9.862494e-03 5.357782e-03 1 1 4 9.862494e-03 5.357782e-03
1 1 5 1.953851e-03 5.278829e-04 1 1 5 1.953851e-03 5.278829e-04
1 1 6 1.696005e-04 9.031012e-05 1 1 6 1.696005e-04 9.031012e-05
1 1 7 3.336515e-05 1.689661e-05 1 1 7 3.336515e-05 1.689661e-05
1 1 8 6.778363e-06 0.000000e+00 1 1 8 6.778363e-06 0.000000e+00
2 1 0 7.781387e+00 0.000000e+00 2 1 0 7.781387e+00 0.000000e+00
2 1 1 1.845803e-01 7.916336e-01 2 1 1 1.845803e-01 7.916336e-01
2 1 2 -1.197738e-01 -8.355770e-03 2 1 2 -1.197738e-01 -8.355770e-03
2 1 3 1.815930e-02 4.052258e-03 2 1 3 1.815930e-02 4.052258e-03
2 1 4 3.664388e-03 -1.248370e-03 2 1 4 3.664388e-03 -1.248370e-03
2 1 5 7.197363e-04 -1.158727e-04 2 1 5 7.197363e-04 -1.158727e-04
2 1 6 6.479225e-05 -2.354105e-05 2 1 6 6.479225e-05 -2.354105e-05
2 1 7 1.337388e-06 -9.214752e-06 2 1 7 1.337388e-06 -9.214752e-06
2 1 8 -1.081962e-06 0.000000e+00 2 1 8 -1.081962e-06 0.000000e+00
6 1 0 6.358501e+00 0.000000e+00 6 1 0 6.358501e+00 0.000000e+00
6 1 1 2.536977e+00 -5.387693e-01 6 1 1 2.536977e+00 -5.387693e-01
6 1 2 4.336681e-01 -4.918158e-02 6 1 2 4.336681e-01 -4.918158e-02
6 1 3 5.356017e-02 -2.574008e-02 6 1 3 5.356017e-02 -2.574008e-02
6 1 4 6.028301e-03 -1.937373e-03 6 1 4 6.028301e-03 -1.937373e-03
6 1 5 8.539012e-04 -1.310331e-04 6 1 5 8.539012e-04 -1.310331e-04
6 1 6 8.683758e-05 -1.604958e-05 6 1 6 8.683758e-05 -1.604958e-05
6 1 7 2.742399e-05 -9.454013e-06 6 1 7 2.742399e-05 -9.454013e-06
6 1 8 3.934656e-07 0.000000e+00 6 1 8 3.934656e-07 0.000000e+00
Even with reproducerandomnumbers=yes, I do see a very significant difference between MPI / no MPI on the level of the correlators though:
11:51 kostrzew@blade8b ~/code/tmLQCD.kost (etmcmaster|✚3…) $ diff -y -W 100 build_3D_MPI_hs_pax/onlinemeas.000000 build_openmp_wgs/onlinemeas.000000.omp16_kahan
1 1 0 3.583142e+01 0.000000e+00 | 1 1 0 2.493487e+01 0.000000e+00
1 1 1 3.296734e+00 2.969451e+00 | 1 1 1 3.632671e+00 2.436011e+00
1 1 2 4.057276e-01 3.517819e-01 | 1 1 2 5.727120e-01 2.451563e-01
1 1 3 4.517120e-02 5.054174e-02 | 1 1 3 7.388951e-02 4.784140e-02
1 1 4 5.965195e-03 6.431571e-03 | 1 1 4 9.862494e-03 5.357782e-03
1 1 5 8.324182e-04 9.961091e-04 | 1 1 5 1.953851e-03 5.278829e-04
1 1 6 1.292595e-04 1.470315e-04 | 1 1 6 1.696005e-04 9.031012e-05
1 1 7 1.502888e-05 2.649542e-05 | 1 1 7 3.336515e-05 1.689661e-05
1 1 8 5.902200e-06 0.000000e+00 | 1 1 8 6.778363e-06 0.000000e+00
2 1 0 1.855315e+00 0.000000e+00 | 2 1 0 7.781387e+00 0.000000e+00
2 1 1 1.041653e+00 -1.095293e+00 | 2 1 1 1.845803e-01 7.916336e-01
2 1 2 1.065586e-01 -1.244458e-01 | 2 1 2 -1.197738e-01 -8.355770e-03
2 1 3 1.143322e-02 -1.645176e-02 | 2 1 3 1.815930e-02 4.052258e-03
2 1 4 3.802521e-04 -2.712032e-03 | 2 1 4 3.664388e-03 -1.248370e-03
2 1 5 1.804312e-04 -2.805560e-04 | 2 1 5 7.197363e-04 -1.158727e-04
2 1 6 3.510223e-05 -6.367013e-05 | 2 1 6 6.479225e-05 -2.354105e-05
2 1 7 5.462901e-06 -3.206055e-06 | 2 1 7 1.337388e-06 -9.214752e-06
2 1 8 1.426100e-06 0.000000e+00 | 2 1 8 -1.081962e-06 0.000000e+00
6 1 0 -2.878380e+00 0.000000e+00 | 6 1 0 6.358501e+00 0.000000e+00
6 1 1 2.249188e+00 -1.297361e+00 | 6 1 1 2.536977e+00 -5.387693e-01
6 1 2 2.453926e-01 -1.254185e-01 | 6 1 2 4.336681e-01 -4.918158e-02
6 1 3 1.056075e-02 -1.683040e-02 | 6 1 3 5.356017e-02 -2.574008e-02
6 1 4 3.291887e-03 -2.817410e-03 | 6 1 4 6.028301e-03 -1.937373e-03
6 1 5 4.616233e-04 -5.540282e-04 | 6 1 5 8.539012e-04 -1.310331e-04
6 1 6 7.452629e-05 -8.447372e-05 | 6 1 6 8.683758e-05 -1.604958e-05
6 1 7 8.840458e-06 -1.403312e-05 | 6 1 7 2.742399e-05 -9.454013e-06
6 1 8 -2.264645e-07 0.000000e+00 | 6 1 8 3.934656e-07 0.000000e+00
On the level of the plaquette these runs seems to be completely compatible:
12:51 kostrzew@blade8f ~/code/tmLQCD.kost (etmcmaster|✚3…) $ diff -y -W 118 build_openmp_wgs/omp16_kahan.data build_3D_MPI_hs_pax/mpi_8_3D.data
00000000 0.297755975545 -5.772757032268 3.214227e+02 2 | 00000000 0.297755975545 -5.772757032333 3.214227e+02 2
00000001 0.383681731184 -3.657660451776 3.877053e+01 2 | 00000001 0.383681731184 -3.657660451849 3.877053e+01 2
00000002 0.429726941943 -1.414715887855 4.115317e+00 2 | 00000002 0.429726941943 -1.414715887789 4.115317e+00 2
00000003 0.456816154200 -1.238757090774 3.451321e+00 2 | 00000003 0.456816154200 -1.238757090730 3.451321e+00 2
00000004 0.472229062306 -0.686094532699 1.985944e+00 2 | 00000004 0.472229062306 -0.686094532699 1.985944e+00 2
00000005 0.487670077287 -0.604563556219 1.830453e+00 2 | 00000005 0.487670077287 -0.604563556211 1.830453e+00 2
00000006 0.493795474300 0.128804694687 8.791457e-01 27 | 00000006 0.493795474300 0.128804694752 8.791457e-01 27
00000007 0.493795474300 0.136298800651 8.725819e-01 27 | 00000007 0.493795474300 0.136298800579 8.725819e-01 27
00000008 0.498644966981 -0.231961259276 1.261071e+00 2 | 00000008 0.498644966981 -0.231961259371 1.261071e+00 2
00000009 0.498644966981 0.234397840482 7.910470e-01 27 | 00000009 0.498644966981 0.234397840381 7.910470e-01 27
00000010 0.503439824602 0.233070208727 7.920980e-01 27 | 00000010 0.503439824602 0.233070208807 7.920980e-01 27
00000011 0.505482558884 0.069847415791 9.325361e-01 27 | 00000011 0.505482558884 0.069847415849 9.325361e-01 27
00000012 0.508547235495 0.112196140770 8.938689e-01 27 | 00000012 0.508547235495 0.112196140682 8.938689e-01 27
00000013 0.510466965313 -0.388601017701 1.474916e+00 2 | 00000013 0.510466965313 -0.388601017796 1.474916e+00 2
00000014 0.511048831401 0.168259862352 8.451342e-01 27 | 00000014 0.511048831401 0.168259862337 8.451342e-01 27
00000015 0.511049193730 0.180563802802 8.347994e-01 27 | 00000015 0.511049193730 0.180563802882 8.347994e-01 27
00000016 0.511152568268 0.078752239344 9.242689e-01 27 | 00000016 0.511152568267 0.078752239511 9.242689e-01 27
00000017 0.511152568268 0.323074745895 7.239197e-01 27 | 00000017 0.511152568267 0.323074745771 7.239197e-01 27
00000018 0.513700434563 -0.177689988130 1.194455e+00 2 | 00000018 0.513700434561 -0.177689987970 1.194455e+00 2
00000019 0.513642212879 -0.379668667214 1.461800e+00 2 | 00000019 0.513642212881 -0.379668667163 1.461800e+00 2
And just for good measure, another comparison after one trajectory of the MPI and non-MPI correlator but with 16 MPI processes this time:
12:52 kostrzew@blade8f ~/code/tmLQCD.kost (etmcmaster|✚3…) $ diff -y -W 118 build_openmp_wgs/onlinemeas.000000 build_3D_MPI_hs_wgs/onlinemeas.000000
1 1 0 2.493487e+01 0.000000e+00 | 1 1 0 3.284327e+01 0.000000e+00
1 1 1 3.632671e+00 2.436011e+00 | 1 1 1 4.224729e+00 3.589291e+00
1 1 2 5.727120e-01 2.451563e-01 | 1 1 2 5.302579e-01 5.559800e-01
1 1 3 7.388951e-02 4.784140e-02 | 1 1 3 7.702127e-02 6.436265e-02
1 1 4 9.862494e-03 5.357782e-03 | 1 1 4 1.164042e-02 8.727955e-03
1 1 5 1.953851e-03 5.278829e-04 | 1 1 5 1.282810e-03 1.408820e-03
1 1 6 1.696005e-04 9.031012e-05 | 1 1 6 1.867620e-04 1.945576e-04
1 1 7 3.336515e-05 1.689661e-05 | 1 1 7 2.810947e-05 2.461970e-05
1 1 8 6.778363e-06 0.000000e+00 | 1 1 8 8.242407e-06 0.000000e+00
2 1 0 7.781387e+00 0.000000e+00 | 2 1 0 2.599950e+00 0.000000e+00
2 1 1 1.845803e-01 7.916336e-01 | 2 1 1 6.436937e-01 -9.233037e-01
2 1 2 -1.197738e-01 -8.355770e-03 | 2 1 2 5.892140e-02 -1.203017e-01
2 1 3 1.815930e-02 4.052258e-03 | 2 1 3 1.946809e-02 -1.400210e-02
2 1 4 3.664388e-03 -1.248370e-03 | 2 1 4 1.817279e-03 -3.063235e-03
2 1 5 7.197363e-04 -1.158727e-04 | 2 1 5 4.982813e-04 -3.622151e-04
2 1 6 6.479225e-05 -2.354105e-05 | 2 1 6 3.976636e-05 -5.470423e-05
2 1 7 1.337388e-06 -9.214752e-06 | 2 1 7 3.289593e-06 -5.885289e-06
2 1 8 -1.081962e-06 0.000000e+00 | 2 1 8 5.087794e-08 0.000000e+00
6 1 0 6.358501e+00 0.000000e+00 | 6 1 0 -9.992991e-01 0.000000e+00
6 1 1 2.536977e+00 -5.387693e-01 | 6 1 1 2.641958e+00 -2.032417e+00
6 1 2 4.336681e-01 -4.918158e-02 | 6 1 2 3.272099e-01 -3.789808e-01
6 1 3 5.356017e-02 -2.574008e-02 | 6 1 3 4.743740e-02 -4.163792e-02
6 1 4 6.028301e-03 -1.937373e-03 | 6 1 4 8.175692e-03 -5.811491e-03
6 1 5 8.539012e-04 -1.310331e-04 | 6 1 5 7.177018e-04 -7.013163e-04
6 1 6 8.683758e-05 -1.604958e-05 | 6 1 6 1.079254e-04 -1.139296e-04
6 1 7 2.742399e-05 -9.454013e-06 | 6 1 7 2.037146e-05 -1.326496e-05
6 1 8 3.934656e-07 0.000000e+00 | 6 1 8 -3.225093e-07 0.000000e+00
And a comparison from two different MPI parallelizations which also doesn't match:
12:54 kostrzew@blade8f ~/code/tmLQCD.kost (etmcmaster|✚3…) $ diff -y -W 118 build_3D_MPI_hs_pax/onlinemeas.000000.3D_MPI_hs_pax build_3D_MPI_hs_wgs/onlinemeas.000000
1 1 0 3.583142e+01 0.000000e+00 | 1 1 0 3.284327e+01 0.000000e+00
1 1 1 3.296734e+00 2.969451e+00 | 1 1 1 4.224729e+00 3.589291e+00
1 1 2 4.057276e-01 3.517819e-01 | 1 1 2 5.302579e-01 5.559800e-01
1 1 3 4.517120e-02 5.054174e-02 | 1 1 3 7.702127e-02 6.436265e-02
1 1 4 5.965195e-03 6.431571e-03 | 1 1 4 1.164042e-02 8.727955e-03
1 1 5 8.324182e-04 9.961091e-04 | 1 1 5 1.282810e-03 1.408820e-03
1 1 6 1.292595e-04 1.470315e-04 | 1 1 6 1.867620e-04 1.945576e-04
1 1 7 1.502888e-05 2.649542e-05 | 1 1 7 2.810947e-05 2.461970e-05
1 1 8 5.902200e-06 0.000000e+00 | 1 1 8 8.242407e-06 0.000000e+00
2 1 0 1.855315e+00 0.000000e+00 | 2 1 0 2.599950e+00 0.000000e+00
2 1 1 1.041653e+00 -1.095293e+00 | 2 1 1 6.436937e-01 -9.233037e-01
2 1 2 1.065586e-01 -1.244458e-01 | 2 1 2 5.892140e-02 -1.203017e-01
2 1 3 1.143322e-02 -1.645176e-02 | 2 1 3 1.946809e-02 -1.400210e-02
2 1 4 3.802521e-04 -2.712032e-03 | 2 1 4 1.817279e-03 -3.063235e-03
2 1 5 1.804312e-04 -2.805560e-04 | 2 1 5 4.982813e-04 -3.622151e-04
2 1 6 3.510223e-05 -6.367013e-05 | 2 1 6 3.976636e-05 -5.470423e-05
2 1 7 5.462901e-06 -3.206055e-06 | 2 1 7 3.289593e-06 -5.885289e-06
2 1 8 1.426100e-06 0.000000e+00 | 2 1 8 5.087794e-08 0.000000e+00
6 1 0 -2.878380e+00 0.000000e+00 | 6 1 0 -9.992991e-01 0.000000e+00
6 1 1 2.249188e+00 -1.297361e+00 | 6 1 1 2.641958e+00 -2.032417e+00
6 1 2 2.453926e-01 -1.254185e-01 | 6 1 2 3.272099e-01 -3.789808e-01
6 1 3 1.056075e-02 -1.683040e-02 | 6 1 3 4.743740e-02 -4.163792e-02
6 1 4 3.291887e-03 -2.817410e-03 | 6 1 4 8.175692e-03 -5.811491e-03
6 1 5 4.616233e-04 -5.540282e-04 | 6 1 5 7.177018e-04 -7.013163e-04
6 1 6 7.452629e-05 -8.447372e-05 | 6 1 6 1.079254e-04 -1.139296e-04
6 1 7 8.840458e-06 -1.403312e-05 | 6 1 7 2.037146e-05 -1.326496e-05
6 1 8 -2.264645e-07 0.000000e+00 | 6 1 8 -3.225093e-07 0.000000e+00
I think, therefore - if someone could please cross-check - that we can conclude that the problem originates from the MPI sum and gather operations in the correlator computation.
I will attempt to reproduce this with icc and OpenMPI.
The differences seem too big to be a rounding issue to me. Since this summation is done at the very end of the calculation, there's nothing to enhance the differences.
I tried the same just now with gcc on my laptop and I get a clear difference between 4 processes and 2... I even used \tau=0.000001 and reproducerandomnumbers=yes
13:14 bartek@artemis ~/code/tmLQCD.kost/build_mpi (etmcmaster|✚1…) $ diff -y -W 100 onlinemeas.000000.mpi_4_1D onlinemeas.000000.mpi_2_1D
1 1 0 2.395947e+01 0.000000e+00 | 1 1 0 2.525002e+01 0.000000e+00
1 1 1 3.709687e+00 2.424652e+00 | 1 1 1 2.803885e+00 2.309774e+00
1 1 2 4.172649e-01 3.800205e-01 | 1 1 2 2.411146e-01 2.188681e-01
1 1 3 5.927897e-02 3.897985e-02 | 1 1 3 2.163117e-02 2.591199e-02
1 1 4 5.108701e-03 4.516274e-03 | 1 1 4 2.383348e-03 3.268691e-03
1 1 5 1.403971e-03 8.718435e-04 | 1 1 5 4.598607e-04 5.547285e-04
1 1 6 2.138396e-04 8.323216e-05 | 1 1 6 5.640558e-05 5.988951e-05
1 1 7 2.243546e-05 9.381115e-06 | 1 1 7 7.864775e-06 8.542043e-06
1 1 8 3.694871e-06 0.000000e+00 | 1 1 8 2.528695e-06 0.000000e+00
2 1 0 -2.712960e+00 0.000000e+00 | 2 1 0 -6.881594e+00 0.000000e+00
2 1 1 9.025073e-01 -9.609200e-02 | 2 1 1 -9.285053e-01 -5.379556e-01
2 1 2 1.301342e-03 2.410228e-02 | 2 1 2 1.556907e-02 -7.776891e-02
2 1 3 2.316897e-03 -7.541878e-03 | 2 1 3 7.572548e-03 -7.184931e-03
2 1 4 8.655893e-04 -3.117261e-04 | 2 1 4 -5.365922e-04 -2.508335e-05
2 1 5 2.192285e-04 -1.870955e-04 | 2 1 5 -1.207693e-04 -1.630122e-05
2 1 6 5.546197e-05 -1.978795e-05 | 2 1 6 5.634186e-06 -8.677705e-07
2 1 7 3.643263e-06 -2.283807e-06 | 2 1 7 2.856764e-06 -1.348469e-06
2 1 8 1.599540e-07 0.000000e+00 | 2 1 8 -6.402346e-07 0.000000e+00
6 1 0 -5.872514e+00 0.000000e+00 | 6 1 0 1.228379e+00 0.000000e+00
6 1 1 1.127542e+00 -3.607082e-01 | 6 1 1 2.861621e-01 -1.159990e+00
6 1 2 1.450568e-01 -2.212542e-01 | 6 1 2 1.274718e-01 -1.419306e-01
6 1 3 2.650210e-02 -1.763487e-02 | 6 1 3 1.316996e-02 -3.497891e-03
6 1 4 4.032474e-03 -3.384757e-03 | 6 1 4 2.194532e-04 -1.582169e-03
6 1 5 1.016316e-03 -4.139272e-04 | 6 1 5 3.260346e-04 -1.254990e-04
6 1 6 1.000620e-04 -2.953921e-05 | 6 1 6 3.996471e-05 -2.192821e-05
6 1 7 4.957542e-06 -4.314978e-06 | 6 1 7 3.662338e-06 -4.128499e-06
6 1 8 7.337025e-07 0.000000e+00 | 6 1 8 -2.604950e-07 0.000000e+00
13:15 bartek@artemis ~/code/tmLQCD.kost/build_mpi (etmcmaster|✚1…) $ diff -y -W 118 mpi_4_1D.data mpi_2_1D.data
00000000 0.121247194262 0.000000000007 1.000000e+00 31 | 00000000 0.121247194262 0.000000000189 1.000000e+00 31
Is it worrying that dH is so vastly different (at this level of precision) even though I'm using reproducerandomnumbers=yes and tau=0.000001 ?
For these two runs the checksums for the gauge configuration at the end of the trajectory are even exactly the same.
@urbach Could this be the culprit? mpi_time_rank = 0 for all processes?
# Process 1 of 4 on artemis: cart_id 1, coordinates (1 0 0 0)
# Process 2 of 4 on artemis: cart_id 2, coordinates (2 0 0 0)
# Process 0 of 4 on artemis: cart_id 0, coordinates (0 0 0 0)
# Process 3 of 4 on artemis: cart_id 3, coordinates (3 0 0 0)
# My mpi_time_rank = 0, g_proc_coords = (2,0,0,0), g_cart_id = 2
# My mpi_time_rank = 0, g_proc_coords = (1,0,0,0), g_cart_id = 1
# My mpi_time_rank = 0, g_proc_coords = (0,0,0,0), g_cart_id = 0
# My mpi_time_rank = 0, g_proc_coords = (3,0,0,0), g_cart_id = 3
# My mpi_z_rank = 0, g_proc_coords = (0,0,0,0), g_cart_id = 0
# My mpi_z_rank = 2, g_proc_coords = (2,0,0,0), g_cart_id = 2
# My mpi_z_rank = 1, g_proc_coords = (1,0,0,0), g_cart_id = 1
# My mpi_z_rank = 3, g_proc_coords = (3,0,0,0), g_cart_id = 3
# My mpi_SV_rank = 1, g_proc_coords = (1,0,0,0), g_cart_id = 1
# My mpi_SV_rank = 0, g_proc_coords = (0,0,0,0), g_cart_id = 0
# My mpi_SV_rank = 2, g_proc_coords = (2,0,0,0), g_cart_id = 2
# My mpi_SV_rank = 3, g_proc_coords = (3,0,0,0), g_cart_id = 3
# My mpi_ST_rank = 0, g_proc_coords = (2,0,0,0), g_cart_id = 2
Well, I can reproduce the results with a parallelization just in the X and Y directions. Same plaquette value, completely different correlators for runs with 1, 2 and 4 MPI processes. But in that case, it couldn't be mpi_time_rank, could it?
Well, I can reproduce the results with a parallelization just in the X and Y directions. Same plaquette value, completely different correlators for runs with 1, 2 and 4 MPI processes. But in that case, it couldn't be mpi_time_rank, could it?
Can the computation, as it is currently written, even work with XY parallelization?
Can the computation, as it is currently written, even work with XY parallelization?
I may be missing the point, but wouldn't the reduction over g_mpi_time_slices
take care of this? In this case, g_mpi_time_slices
should just be the whole Cartesian communicator.
Hmm. If I just parallelize in the T direction, however, it seems I get consistent results.
Hmm. If I just parallelize in the T direction, however, it seems I get consistent results.
Interesting, I don't... can you run at debuglevel=5 and see the initial messages?
Just checking to make sure I didn't confuse myself... :)
Alright, this was already checked I guess, but I get agreement for a scalar build and an MPI build run with a single process.
diff -y -W 100 onlinemeas.000006.p1 onlinemeas.000006.s1 1 1 0 2.557430e+01 0.000000e+00 1 1 0 2.557430e+01 0.000000e+00 1 1 1 1.829296e+00 3.758436e+00 1 1 1 1.829296e+00 3.758436e+00 1 1 2 4.322338e-01 0.000000e+00 1 1 2 4.322338e-01 0.000000e+00 2 1 0 -1.204301e+01 0.000000e+00 2 1 0 -1.204301e+01 0.000000e+00 2 1 1 -6.435530e-01 -5.468982e-01 2 1 1 -6.435530e-01 -5.468982e-01 2 1 2 -3.410654e-02 0.000000e+00 2 1 2 -3.410654e-02 0.000000e+00 6 1 0 -1.249877e+01 0.000000e+00 6 1 0 -1.249877e+01 0.000000e+00 6 1 1 6.198682e-01 -1.978413e+00 6 1 1 6.198682e-01 -1.978413e+00 6 1 2 -1.097072e-01 0.000000e+00 6 1 2 -1.097072e-01 0.000000e+00
The same for a parallelization in the T direction.
diff -y -W 100 onlinemeas.000006.p1 onlinemeas.000006.p2t 1 1 0 2.557430e+01 0.000000e+00 1 1 0 2.557430e+01 0.000000e+00 1 1 1 1.829296e+00 3.758436e+00 1 1 1 1.829296e+00 3.758436e+00 1 1 2 4.322338e-01 0.000000e+00 1 1 2 4.322338e-01 0.000000e+00 2 1 0 -1.204301e+01 0.000000e+00 2 1 0 -1.204301e+01 0.000000e+00 2 1 1 -6.435530e-01 -5.468982e-01 2 1 1 -6.435530e-01 -5.468982e-01 2 1 2 -3.410654e-02 0.000000e+00 2 1 2 -3.410654e-02 0.000000e+00 6 1 0 -1.249877e+01 0.000000e+00 6 1 0 -1.249877e+01 0.000000e+00 6 1 1 6.198682e-01 -1.978413e+00 6 1 1 6.198682e-01 -1.978413e+00 6 1 2 -1.097072e-01 0.000000e+00 6 1 2 -1.097072e-01 0.000000e+00
But not for parallelization in the X direction!
diff -y -W 100 onlinemeas.000006.p2t onlinemeas.000006.p2x 1 1 0 2.557430e+01 0.000000e+00 | 1 1 0 2.026491e+01 0.000000e+00 1 1 1 1.829296e+00 3.758436e+00 | 1 1 1 3.489072e+00 3.597169e+00 1 1 2 4.322338e-01 0.000000e+00 | 1 1 2 8.587605e-01 0.000000e+00 2 1 0 -1.204301e+01 0.000000e+00 | 2 1 0 -1.032211e+01 0.000000e+00 2 1 1 -6.435530e-01 -5.468982e-01 | 2 1 1 -2.461695e+00 -1.852010e-01 2 1 2 -3.410654e-02 0.000000e+00 | 2 1 2 -2.805854e-01 0.000000e+00 6 1 0 -1.249877e+01 0.000000e+00 | 6 1 0 -1.925853e-01 0.000000e+00 6 1 1 6.198682e-01 -1.978413e+00 | 6 1 1 8.507125e-01 -1.482761e+00 6 1 2 -1.097072e-01 0.000000e+00 | 6 1 2 9.406330e-02 0.000000e+00
Perhaps the communicators aren't set up properly? At any rate, it's weird that time parallelization seems to cause me no trouble, but it trips you up.. Did you compile your code with 4D parallelization?
Perhaps the communicators aren't set up properly? At any rate, it's weird that time parallelization seems to cause me no trouble, but it trips you up.. Did you compile your code with 4D parallelization?
no, with 1D parallelization in T (the default)
Could you try slightly larger volumes? say 8^3x16, since it's only one trajectory and tau can be very small, it is fast
My guess is, that there is an issue in the construction of the timeslice communicators. g_mpi_time_slices and g_mpi_SV_slices are used in two places: online_measurement.c and polyakov_loop.c. So nothing should be affected in the HMC, just these two measurements. And we just got a bug report on the Polyakov loop, too (#251).
Could you try slightly larger volumes?
Sure.
Interesting -- the larger volume matters. Now I also see issues with parallelization in the T direction.
diff -y -W 100 onlinemeas.000000.p1 onlinemeas.000000.p2 1 1 0 3.198275e+01 0.000000e+00 | 1 1 0 3.115061e+01 0.000000e+00 1 1 1 4.723553e+00 2.504199e+00 | 1 1 1 1.450164e+00 3.815449e+00 1 1 2 5.636580e-01 4.603009e-01 | 1 1 2 3.065834e-01 5.968957e-01 1 1 3 6.607852e-02 6.614074e-02 | 1 1 3 5.894926e-02 6.846152e-02 1 1 4 1.067107e-02 4.709285e-03 | 1 1 4 5.882510e-03 7.031022e-03 1 1 5 1.378402e-03 6.325933e-04 | 1 1 5 1.007201e-03 1.206607e-03 1 1 6 1.849708e-04 1.058444e-04 | 1 1 6 1.473279e-04 1.805708e-04 1 1 7 2.447191e-05 1.709129e-05 | 1 1 7 3.426484e-05 2.774201e-05 1 1 8 4.931717e-06 0.000000e+00 | 1 1 8 1.056044e-05 0.000000e+00 [snip]
Perhaps this is redundant, but I still get identical results for the scalar build and the MPI one run with a single process.
hmm, interesting...
One last test, since you already have those numbers, can you try a 8^4 volume (rather than T=16)?
I will do this, but I'm attending a seminar now. Will get back to you in an hour.
Oh okay, don't worry then, I just ran the test and also don't get agreement. (I was thinking that perhaps something was going wrong in the calculation of some sidelength)
Perhaps this is redundant, but I still get identical results for the scalar build and the MPI one run with a single process.
I can confirm having tested this too.
From what I understand I cannot really comprehend why all processes get the same mpi_time_rank. In MPI_comm_split they all have different "colors" (g_proc_coords[0]).
However, as a consequence of being assigned the same mpi_time_rank, the SV_slices will be wrongly attributed. Correct?
Hmm, okay so going a bit further in TX parallelization the logic seems to work correctly but something in the reduction is a bit strange. See below res and mpi_res:
t:0 res: 1.219950 mp_res: 0.000000 coords: 0 1 0 0
t:1 res: 0.228919 mp_res: 0.000000 coords: 0 1 0 0
t:2 res: 0.553401 mp_res: 0.000000 coords: 0 1 0 0
t:3 res: 5.602785 mp_res: 0.000000 coords: 0 1 0 0
t:0 res: 50.466172 mp_res: 0.000000 coords: 1 1 0 0
t:1 res: 531.968764 mp_res: 0.000000 coords: 1 1 0 0
t:2 res: 74.702616 mp_res: 0.000000 coords: 1 1 0 0
t:3 res: 13.107516 mp_res: 0.000000 coords: 1 1 0 0
t:0 res: 45.302721 mp_res: 95.768892 coords: 1 0 0 0
t:1 res: 469.846250 mp_res: 1001.815014 coords: 1 0 0 0
t:2 res: 54.647303 mp_res: 129.349919 coords: 1 0 0 0
t:3 res: 4.973098 mp_res: 18.080614 coords: 1 0 0 0
t:0 res: 0.732196 mp_res: 1.952146 coords: 0 0 0 0
t:1 res: 0.180954 mp_res: 0.409872 coords: 0 0 0 0
t:2 res: 0.899016 mp_res: 1.452417 coords: 0 0 0 0
t:3 res: 7.200939 mp_res: 12.803724 coords: 0 0 0 0
it seems like the processes at 1 1 and 0 1 end up with vanishing Cpp[t] ? The processes 1 0 and 0 0 seem to end up with correct value after reduction. I also checked for 1 dim parallelization and the local value matches the mpi reduced value, as expected.
Now after the gather operation (-1 means "outside of the local timeslice"):
t:0 sCpp[t]: 2.985231 Cpp[t]: 0.000000 coords: 1 0 0 0
t:1 sCpp[t]: 31.227775 Cpp[t]: 0.000000 coords: 1 0 0 0
t:2 sCpp[t]: 4.031992 Cpp[t]: 0.000000 coords: 1 0 0 0
t:3 sCpp[t]: 0.563594 Cpp[t]: 0.000000 coords: 1 0 0 0
t:4 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 1 0 0 0
t:5 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 1 0 0 0
t:6 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 1 0 0 0
t:7 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 1 0 0 0
t:0 sCpp[t]: 0.060851 Cpp[t]: 0.060851 coords: 0 0 0 0
t:1 sCpp[t]: 0.012776 Cpp[t]: 0.012776 coords: 0 0 0 0
t:2 sCpp[t]: 0.045274 Cpp[t]: 0.045274 coords: 0 0 0 0
t:3 sCpp[t]: 0.399107 Cpp[t]: 0.399107 coords: 0 0 0 0
t:4 sCpp[t]: -1.000000 Cpp[t]: 2.985231 coords: 0 0 0 0
t:5 sCpp[t]: -1.000000 Cpp[t]: 31.227775 coords: 0 0 0 0
t:6 sCpp[t]: -1.000000 Cpp[t]: 4.031992 coords: 0 0 0 0
t:7 sCpp[t]: -1.000000 Cpp[t]: 0.563594 coords: 0 0 0 0
it certainly seems like g_cart_id=0 has all the correct information gathered together... Perhaps the problem is not with the communicators after all?
For 1D parallelization everything seems to work fine too:
t:0 res: 1.119806 mp_res: 1.119806 coords: 0 0 0 0
t:1 res: 0.456280 mp_res: 0.456280 coords: 0 0 0 0
t:0 res: 2.067068 mp_res: 2.067068 coords: 1 0 0 0
t:1 res: 20.173998 mp_res: 20.173998 coords: 1 0 0 0
t:0 res: 101.452179 mp_res: 101.452179 coords: 2 0 0 0
t:1 res: 918.865188 mp_res: 918.865188 coords: 2 0 0 0
t:0 res: 116.694084 mp_res: 116.694084 coords: 3 0 0 0
t:1 res: 12.207416 mp_res: 12.207416 coords: 3 0 0 0
t:0 sCpp[t]: 0.034906 Cpp[t]: 0.034906 coords: 0 0 0 0
t:1 sCpp[t]: 0.014223 Cpp[t]: 0.014223 coords: 0 0 0 0
t:2 sCpp[t]: -1.000000 Cpp[t]: 0.064433 coords: 0 0 0 0
t:3 sCpp[t]: -1.000000 Cpp[t]: 0.628848 coords: 0 0 0 0
t:4 sCpp[t]: -1.000000 Cpp[t]: 3.162386 coords: 0 0 0 0
t:5 sCpp[t]: -1.000000 Cpp[t]: 28.642130 coords: 0 0 0 0
t:6 sCpp[t]: -1.000000 Cpp[t]: 3.637495 coords: 0 0 0 0
t:7 sCpp[t]: -1.000000 Cpp[t]: 0.380520 coords: 0 0 0 0
t:0 sCpp[t]: 0.064433 Cpp[t]: 0.000000 coords: 1 0 0 0
t:1 sCpp[t]: 0.628848 Cpp[t]: 0.000000 coords: 1 0 0 0
t:2 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 1 0 0 0
t:3 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 1 0 0 0
t:4 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 1 0 0 0
t:5 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 1 0 0 0
t:6 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 1 0 0 0
t:7 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 1 0 0 0
t:0 sCpp[t]: 3.162386 Cpp[t]: 0.000000 coords: 2 0 0 0
t:1 sCpp[t]: 28.642130 Cpp[t]: 0.000000 coords: 2 0 0 0
t:2 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 2 0 0 0
t:3 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 2 0 0 0
t:4 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 2 0 0 0
t:5 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 2 0 0 0
t:6 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 2 0 0 0
t:7 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 2 0 0 0
t:0 sCpp[t]: 3.637495 Cpp[t]: 0.000000 coords: 3 0 0 0
t:1 sCpp[t]: 0.380520 Cpp[t]: 0.000000 coords: 3 0 0 0
t:2 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 3 0 0 0
t:3 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 3 0 0 0
t:4 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 3 0 0 0
t:5 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 3 0 0 0
t:6 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 3 0 0 0
t:7 sCpp[t]: -1.000000 Cpp[t]: 0.000000 coords: 3 0 0 0
From what I understand I cannot really comprehend why all processes get the same mpi_time_rank. In MPI_comm_split they all have different "colors" (g_proc_coords[0]).
I also believe I now understand why this is the case and why this should be so.
Are we sure that the node with g_cart_id = 0 will always end up having rank 0 in one of the sub-communicators?
Just found the answer: since key
is set to g_cart_id
in the splitting, g_cart_id = 0
will always be the 0th process of some subgroup.
It seems there's no subtle issue with the rank assignments. I replaced the MPI_Reduce
and MPI_Gather
calls by MPI_Allreduce
and MPI_Allgather
. That should take care of any rank assignment weirdness, but it doesn't matter. In fact, it doesn't change the (parallelization dependent) value of the correlator whatsoever.
Actually, the correlator must be parallelization dependent... I just remembered that the RNG is reset in source_generation... It shouldn't have an effect on the variance though, unless in the case of MPI the Z4 wall-source is of better quality because we're dealing with many different RNGs?
Actually, the correlator must be parallelization dependent... I just remembered that the RNG is reset in source_generation... It shouldn't have an effect on the variance though, unless in the case of MPI the Z4 wall-source is of better quality because we're dealing with many different RNGs?
Speaking of which, that's wrong, isn't it? All those RNGs should be started with the same seed as they are effectively in "reproducerandomnumbers" mode. All RNGs generate the same amount of random numbers but they are only used when the current global coordinate is on the node.
All RNGs generate the same amount of random numbers but they are only used when the current global coordinate is on the node.
I was just thinking the same thing. If we don't want that behavior in general, we should at least implement it for testing purposes...
There we go, I'm glad that's dealt with :) :
16:37 bartek@artemis ~/code/tmLQCD.kost/build_mpi (etmcmaster|✚4…) $ diff -y -W 100 onlinemeas.000000.mpi_4_1D_repro onlinemeas.000000.mpi_2_1D_repro
1 1 0 3.461198e+01 0.000000e+00 1 1 0 3.461198e+01 0.000000e+00
1 1 1 2.365384e+00 2.891736e+00 1 1 1 2.365384e+00 2.891736e+00
1 1 2 3.562455e-01 4.134450e-01 1 1 2 3.562455e-01 4.134450e-01
1 1 3 6.168106e-02 4.594548e-02 1 1 3 6.168106e-02 4.594548e-02
1 1 4 1.191513e-02 0.000000e+00 1 1 4 1.191513e-02 0.000000e+00
2 1 0 1.431370e+00 0.000000e+00 2 1 0 1.431370e+00 0.000000e+00
2 1 1 1.462085e+00 -1.246998e+00 2 1 1 1.462085e+00 -1.246998e+00
2 1 2 1.237051e-01 -2.307649e-01 2 1 2 1.237051e-01 -2.307649e-01
2 1 3 5.317663e-03 -8.341291e-03 2 1 3 5.317663e-03 -8.341291e-03
2 1 4 8.496635e-03 0.000000e+00 2 1 4 8.496635e-03 0.000000e+00
6 1 0 4.289269e+00 0.000000e+00 6 1 0 4.289269e+00 0.000000e+00
6 1 1 5.386116e-01 -1.821203e+00 6 1 1 5.386116e-01 -1.821203e+00
6 1 2 1.647572e-01 -2.876740e-01 6 1 2 1.647572e-01 -2.876740e-01
6 1 3 4.147460e-02 -2.877868e-02 6 1 3 4.147460e-02 -2.877868e-02
6 1 4 1.312262e-03 0.000000e+00 6 1 4 1.312262e-03 0.000000e+00
I was just thinking the same thing. If we don't want that behavior in general, we should at least implement it for testing purposes...
we have that already, except that source_generation uses it's own "repro" mode, but the seeds were initialized wrongly! I just think that the seed used for this purpose should also depend on the seed in the input file, so that from run to run we avoid using the same random numbers for the measurements in particular (which is the case right now) and source generation in general as this will have implications for correlated fits which use samples from various ensembles.
we have that already, except that source_generation uses it's own "repro" mode, but the seeds were initialized wrongly!
Yes, I formulated that badly -- I meant we should just use a single seed on all nodes at least for now, even if for some obscure reason the different seeds had to be there.
There we go, I'm glad that's dealt with :) :
Awesome! It shouldn't really matter for the variance, as you said. But random numbers do weird things, as we've seen. Perhaps it's worth checking again?
Yes, I formulated that badly -- I meant we should just use a single seed on all nodes at least for now, even if for some obscure reason the different seeds had to be there.
I don't think there is a reason, I think it's a simple oversight.
Awesome! It shouldn't really matter for the variance, as you said. But random numbers do weird things, as we've seen. Perhaps it's worth checking again?
Yes, will do, absolutely.
Btw, all the source generators are affected in this way except for the nucleon one and the extended pion source (the former doesn't use g_cart_id in the seed computation while the latter doesn't use random numbers)
And as a final note, serial and MPI agree to now. I will fix this tomorrow.
I think we have another problem of increased variance due to parallelization, a possible culprit could be OpenMP but I'm not 100% sure yet. It is not visible in the plaquette expectation value but in the measurement of m_{PCAC} it is very evident that something fishy is going on...
I'm currently doing a Nf=2 run at 12^3x20 to learn about basic measurements from beginning to end and noticed that the online measurement of m{PCAC} became much more stable (with respect to variance) when I started running with pure MPI on many nodes rather than doing pure OpenMP on one machine.
I had reproducerandomnumbers=yes during the MPI run which I'm going to change later this afternoon and add some trajectories. For now, see the plot below where the first 31 trajectories are done using OpenMP only on one node. The calm section after that was run with pure MPI while the end was run with pure OpenMP again. I haven't tested the hybrid code yet. Note that this is a tuning run and has been restarted from a run with a different kappa so the first few trajectories don't necessarily say much
As you can see, nothing of the sort shows up in the plaquette, although perhaps the OpenMP tail is a bit too smooth with some long-range (10 trajectories) oscillation?:
This is very worrying. Could it be due to the way the RNG is initialized during source generation?