Nf=8, clover, 32 or 128 MPI processes failure

kostrzewa commented 11 years ago

This is the weirdest bug I've encountered so far. For my 8^4 Nf=8 tests I wrote two clover-improved versions (mpihmc[7,8]) of the mpihmc[3,4] input files. (they were in the tar.gz sent with my short write-up)

The input files work with high acceptance with all parallelizations. However, as soon as I try running with 32 or 128 MPI processes I get no acceptance. It does not matter in which direction I increase the number of MPI processes from 16. (4x2x2x2 gives the same result as 2x2x4x2)

I tried running with the new "reproducerandomnumbers=yes" implementation but this does not work either. Here's an output.data:

00000001 0.122628291572 20.663801733710 1.061268e-09 32 574 32 573 32 575 32 573 0 6.242303e+00 5.104959e-02
00000001 0.122628291572 2972.858771419138 0.000000e+00 32 575 32 575 32 575 32 578 0 5.936613e+00 5.104959e-02
00000001 0.122628291572 91.595371940770 1.662017e-40 32 575 32 574 32 575 32 576 0 6.138474e+00 5.104959e-02

With 16 MPI processes it works just fine! :

00000001 0.323921541997 -18.309050612515 8.943750e+07 33 574 33 575 33 575 33 575 1 4.111015e+00 9.095448e-02
00000002 0.426302397742 -9.813193841299 1.827326e+04 32 576 32 574 32 576 32 576 1 4.143917e+00 1.636234e-01

kostrzewa commented 11 years ago

I think this is another indication that in addition to the problem in the polynomial something else is still going wrong!

kostrzewa commented 11 years ago

Note that the other input files without the clover term work well with 32 and 128 MPI processes, so maybe this is a pecularity in clovertrlog perhaps? Maybe with other parallelizations certain cancellations take place which don't take place with 32 and 128 processes?

urbach commented 11 years ago

to be honest, currently I am mainly concerned to possible problems with our old set-up. But I'll try to have a look to your clover bug. This is clearly not related to the random numbers, as far as I can see...

kostrzewa commented 11 years ago

Indeed, but I think this is related and I wanted to document it in any case.

Also, I've just had a failure with 8 MPI processes for both with and without clover term while openmp and 16 MPI processes are working fine... I really don't know what to make of this!

kostrzewa commented 11 years ago

Also, 3D with 16 MPI processes failed for both input files...

kostrzewa commented 11 years ago

Okay, I selected a different seed and it it works for 8 and 16 MPI processe. The failing seed was 999, the working seed is 5454. With this seed the run with 32 MPI processes still fails though!

urbach commented 11 years ago

then just check what happens at the initialisation of the random number generator?!

kostrzewa commented 11 years ago

Note that this still happens when the old (g_proc_id based) start.c behaviour has been restored and with the corrected step computation. (tested with 32 and 128 processes)

It seems like there might be something fishy with something in the clover implementation?

kostrzewa commented 11 years ago

If I remember correctly something was done in the deri exchange? I don't really understand how the edges are computed explicitly (I understand what they are, but not how they are exactly computed in the code)... Would you mind taking a look whether something is the matter there?

kostrzewa commented 11 years ago

In particular I'm a bit worried about the comment: "send to neigbour to the right is not needed", is this still true with the clover term?

kostrzewa commented 11 years ago

Also, in the repro=1 mpihmc8 (wilson, nf=8, clover) I still see an effect whereas in the run without clover term I see none and perfect agreement so far.

kostrzewa commented 11 years ago

Hmm... now the trlog monomials are created with repro=0, but they never call random numbers, correct? Would be prudent to set the parameter anyway.

kostrzewa commented 11 years ago

or make that 'consistent' rather than prudent

urbach commented 11 years ago

okay, so there might be still something in the clover term, but that should be findable now.

I thought about the xchange_deri, and there is the exchange in positive and negative direction. Just in the _INDEX_INDEP_GEOM part I didn't change it, because I don't know whats going on. I think I wrote this also in some emails...

urbach commented 11 years ago

yes trlog should be set consistent! also for GAUGE monomial... I'll just do that

kostrzewa commented 11 years ago

I thought about the xchange_deri, and there is the exchange in positive and negative direction. Just in the _INDEX_INDEP_GEOM part I didn't change it, because I don't know whats going on. I think I wrote this also in some emails...

Yes I remember, but as far as I can tell the comment itself is still there, even in the "NON-_INDEX_INDEP_GEOM" part. By glancing at the code I see that the negative exchange is implemented though.

urbach commented 11 years ago

I'll also change the default to repro=1 for the moment.

urbach commented 11 years ago

Yes I remember, but as far as I can tell the comment itself is still there, even in the "NON-_INDEX_INDEP_GEOM" part. By glancing at the code I see that the negative exchange is implemented though.

not in the not _INDEX_INDEP_GEOM part, is it? but I could remove it also there and make an explicit statement there that it needs to be changed...

urbach commented 11 years ago

I'll just extent my little tests for clover to understand this better...

urbach commented 11 years ago

sample-hmc-cloverdet.input, 4^4:

serial 00000001 0.312009372442 -0.679240156719 1.972378e+00 99 1526 1 1.509166e+00 1dim 00000001 0.312009372442 -0.679240156710 1.972378e+00 99 1526 1 7.729852e-01 2dim 00000001 0.312009372442 -0.679240156711 1.972378e+00 99 1526 1 7.999549e-01 3dim 00000001 0.312009372442 -0.679240156712 1.972378e+00 99 1526 1 2.005243e+00 4dim 00000001 0.312009372442 -0.679240156713 1.972378e+00 99 1526 1 2.600710e+00

urbach commented 11 years ago

6^4 (2 processes per direction):

serial 00000001 0.307784617489 -4.620146754281 1.015089e+02 113 1826 1 7.874935e+00 1dim 00000001 0.307784617489 -4.620146754292 1.015089e+02 113 1826 1 4.642096e+00 2dim 00000001 0.307784617489 -4.620146754363 1.015089e+02 113 1826 1 3.312812e+00 3dim 00000001 0.307784617489 -4.620146754252 1.015089e+02 113 1826 1 4.479035e+00

6^4 (3 processes in time, 2 in other directions)

serial 00000001 0.307784617489 -4.620146754281 1.015089e+02 113 1826 1 8.133085e+00 1dim 00000001 0.307784617489 -4.620146754287 1.015089e+02 113 1826 1 3.515717e+00 2dim 00000001 0.307784617489 -4.620146754210 1.015089e+02 113 1826 1 5.114528e+00 3dim 00000001 0.307784617489 -4.620146754402 1.015089e+02 113 1826 1 5.348816e+00

urbach commented 11 years ago

8^4 (2 processes per direction):

serial 00000001 0.306241343550 -13.919685134359 1.109794e+06 115 2041 1 2.937812e+01 1dim 00000001 0.306241343550 -13.919685134395 1.109794e+06 115 2041 1 1.715056e+01 2dim 00000001 0.306241343550 -13.919685134304 1.109794e+06 115 2041 1 1.143113e+01 3dim 00000001 0.306241343550 -13.919685134522 1.109794e+06 115 2041 1 1.543494e+01 4dim 00000001 0.306241343550 -13.919685134282 1.109794e+06 115 2041 1 1.599460e+01

urbach commented 11 years ago

I'll try to test also your set-up...

urbach commented 11 years ago

okay, indeed I can reproduce the problem: here again 8^4, but 4 MPI processes in time direction, so the 4dim one is with 32 processes:

serial 00000001 0.306241343550 -13.919685134359 1.109794e+06 115 2041 1 3.066790e+01 1dim 00000001 0.306241343550 -13.919685134395 1.109794e+06 115 2041 1 1.094335e+01 2dim 00000001 0.306241343550 -13.919685134224 1.109794e+06 115 2041 1 1.456445e+01 3dim 00000001 0.306241343550 -13.919685134257 1.109794e+06 115 2041 1 1.722603e+01 4dim 00000001 0.126440482773 41.572861262986 8.813243e-19 115 2039 0 2.116307e+01

urbach commented 11 years ago

this is likely to be something related to the molecular dynamics...

kostrzewa commented 11 years ago

Yes, and I get this failure both with and without the clover term.

urbach commented 11 years ago

Aha, I don't get it without clover... Let me re-check.

urbach commented 11 years ago

with 32 processes I don't see it without the clover term.

urbach commented 11 years ago

actually, setting CSW = 0 makes the problem disappear...

kostrzewa commented 11 years ago

Hmm, that's interesting. Are you using 8 fermions?

kostrzewa commented 11 years ago

Okay, I can't reproduce it anymore without the clover term. What might have happened is that I accidentally ran a clover input file even though I wanted to run a standard one. (I was using my terminal history to run these in quick succession at I might simply have misread the command line in the moment).

In any case, I agree with you that setting csw=0 in all monomials makes the problem go away. Setting it in all but one monomial delays the problem to the 7th or 8th trajectory.

kostrzewa commented 11 years ago

Okay, I'm a bit stumped. I also checked xchange_deri and everything seems correct (memory locations, slices, edges, communicator ID's, ordering etc..)

What is your hunch about the MD?

urbach commented 11 years ago

okay, good that we agree on only clover now...

I see the problem with one CLOVERDET monomial. Immidiately after one trajectory one does not get the same results anymore.

I am trying to test by writing e.g. the deri field to stdout. That is quite difficult, as order stdout is not preserved. But it seems that after the first xchange_deri for certain elements of the derivative there is a difference much larger than expected from rouding. I just don't understand yet whether this is a problem related to an xchange routine, to the derivatives or maybe to something else. In principle we need to find the point in the code, where the difference shows up the first time. As I said, this is ugly, because the order in stdout is not preserved. Otherwise, with repro=1 this is now really easy to check.

Any idea on how to automatise the test, i.e. how to use stdout?

urbach commented 11 years ago

here is the pattern of a diff of two log files of a run with 16 and a run with 32 MPI processes on 8^4. The first set of txyz is the global coordinate, the second xyz is the MPI proc local coordinate of the point. In the latter t is missing, because it changes due to change in parallelisation and then the diff gives a lot of lines that just change in t-local. mu is the direction

The lines that differe have always y=3 and either z=0 or z=3. For 3,3 there are always two lines different, for z=0 one. For z=0 its always mu=2, for z=3 mu=2,3.

you see the difference is in the second significant digit.

id  t  x y z, x y z: mu df.d1                df.d2
99c99
<  24 0 0 3 0, 0 3 0: 2 4.474621e-02 1.217123e+00
---
>  24 0 0 3 0, 0 3 0: 2 -1.059355e-01 1.141170e+00
111,112c111,112
<  27 0 0 3 3, 0 3 3: 2 2.595024e+00 1.394392e+00
<  27 0 0 3 3, 0 3 3: 3 -3.258106e+00 3.484070e-01
---
>  27 0 0 3 3, 0 3 3: 2 2.688018e+00 1.465742e+00
>  27 0 0 3 3, 0 3 3: 3 -3.351100e+00 2.770562e-01
[...]

urbach commented 11 years ago

its not sw_deriv alone and its not sw_spinor alone...

urbach commented 11 years ago

its caused by sw_all, however, could be also something with indexing in sw_deriv or sw_spinor

urbach commented 11 years ago

swm and swp appear to be identical, therefore I think it is not in sw_deriv or sw_spinor

So, its either in xchange_deri or in sw_all

urbach commented 11 years ago

so, I have checked every vv1 and vv2 in sw_all (not every element of these, though), they all appear to be fine. If that is correct, then it must be a bug in xchange_deri. But now I'm going home...

urbach commented 11 years ago

using 4x2x2x2 or 2x4x2x2 parallelisation gives exactly the same results

kostrzewa commented 11 years ago

Yes, I can confirm that. (see issue description :D )

urbach commented 11 years ago

whereas using 2x2x4x2 or 2x2x2x4 for parallelisation there are more differences, e.g. for 2x2x4x2 (same pattern for 2x2x2x4)

<  8 0 0 1 0, 0 0 1 0: 2 3.091473e+00 -1.956364e+00 de
<  8 0 0 1 0, 0 0 1 0: 3 7.349473e-01 -6.966944e-01 de
---
>  8 0 0 1 0, 0 0 1 0: 2 2.704814e+00 -2.029857e+00 de
>  8 0 0 1 0, 0 0 1 0: 3 5.289028e-01 -1.229760e+00 de

and

<  11 0 0 1 3, 0 0 1 3: 2 -7.302970e-01 1.385387e+00 de
<  11 0 0 1 3, 0 0 1 3: 3 4.259897e-01 4.138427e-01 de
---
>  11 0 0 1 3, 0 0 1 3: 2 -4.479089e-01 1.493345e+00 de
>  11 0 0 1 3, 0 0 1 3: 3 5.302609e-01 3.793770e-01 de

So, which boundaries contribute to the derivatives at y=LY-1 with z=0 or z=LZ-1 for directions y and z??

urbach commented 11 years ago

using a 3dim parallelisation with 4x4x2x1 or 4x2x4x1 or 2x4x4x1 the problem is absent...

kostrzewa commented 11 years ago

Hmm.. also, increasing the volume to 16x8^3 makes 4x2x2x2 and 2x4x2x2 work while 2x2x4x2 and 2x2x2x2x4 fail.

kostrzewa commented 11 years ago

Actually, the acceptance simply breaks down later... weird history though:

4x2x2x2

00000004 0.447159311749 -4.656544314203 1.052717e+02 201 3013 1 1.315186e+01
00000005 0.470472611966 -3.395547026535 2.983097e+01 218 3499 1 1.443504e+01
00000006 0.492588631035 -0.913751213928 2.493659e+00 278 4105 1 1.649472e+01
00000007 0.492588631035 213.709623433155 1.538473e-93 320 5072 0 1.964939e+01
00000008 0.492588631035 289.166673507512 2.609212e-126 322 4999 0 2.001714e+01
00000009 0.492588631035 186.485291875084 1.024393e-81 312 5049 0 1.987846e+01
[...]

kostrzewa commented 11 years ago

2x4x2x2

00000004 0.447159311755 -4.656544336904 1.052717e+02 201 3013 1 1.391596e+01
00000005 0.470472611964 -3.395546895452 2.983096e+01 218 3499 1 1.505691e+01
00000006 0.492588631025 -0.913750856300 2.493658e+00 278 4105 1 1.705235e+01
00000007 0.492588631025 213.709629319608 1.538464e-93 320 5071 0 2.471472e+01

urbach commented 11 years ago

acceptance at the beginning is not a good marker. Often dh is negative with large magnitude at the beginning. Should always compare different runs against each other.

kostrzewa commented 11 years ago

Yes I understand that, it's just a bit weird that for certain situations the very first trajectory fails to be accepted and for others several work and then it breaks down.

urbach commented 11 years ago

did you adjust the step-size? And how is serial looking like?

kostrzewa commented 11 years ago

no, that was next on my to-do list though

kostrzewa commented 11 years ago

Hmm no, doubling the step number just delays the issue.

etmc / tmLQCD

Nf=8, clover, 32 or 128 MPI processes failure #181