Implement twisted boundary conditions

martin-ueding commented 7 years ago

For tmLQCD we want to have twisted boundary conditions. This will be a global phase that is multiplied to the gauge field at every lattice site.

The 12-parameter gauge compression is likely to remove the global phase from the gauge field. Therefore changes to the kernels seem to be needed as well.

[x] Derive analytically whether the 12-parameter gauge compression removes the global phase from the gauge field.
[ ] Find a good spot for the global phase. Perhaps in the Geometry class, perhaps in the GaugeHandle.
[ ] Decide whether the global phase should always be used or whether two sets of kernels shall be generated.
[ ] Add the global phase parameter to the place where the gauges are unpacked, that is likely the code generator.

martin-ueding commented 7 years ago

I have tried the reconstruction process and checked whether it preserves the global phase. It does not. You can read more in the Mathematica notebook: Gauge_Compression_Twisted.pdf

There is a need for a global phase factor in the interface somewhere. The generated kernels have to apply the complex phase multiplication in the reconstruction process which happens on-the-fly. Most likely there is no fused-multiply-add that I could hijack, so this will require more floating point operations. In order to retain performance for simple boundary conditions, one must generate another set of kernels that do the extra three multiplications in the reconstruction process.

Alternatively we could say that our arithmetic intensity is low anyway, so three additional multiplications with 1.0 in the simple boundary case is not a problem. In that case, I would just change the existing kernels.

@bjoo, @kostrzewa, how do you feel about wasting three flops vs. generating more kernels?

martin-ueding commented 7 years ago

On the non-kernel side, I think that the global phase should not be in the Geometry object. It is something that only concerns the gauge field. One might want to have multiple gauge fields with different boundary conditions for tests. The functions don't use a RAII container for the gauge field. Using relatively new GaugeHandle class and a double phase[2] there would seem a more appropriate choice. Adding this phase as additional parameters for all the functions seems like a mess, it would lose the cohesion between the gauge field and the phase.

This would only be used by tmLQCD in the beginning, therefore one could just redefine all the interfaces with an enhanced GaugeHandle instead of C-arrays of Geometry::SU3MatrixBlock. If the twisted boundary conditions were to be supported in Chroma, that would be a good opportunity to remove the manual allocation/free of QPhiX gauge field but using a RAII container.

This is the kernel generator code that loads the gauge.

void loadGaugeDir(InstVector &ivector, int dir, bool compress12)
{
  string mask;

  PrefetchL1FullGaugeDirIn(ivector, gBase, gOffs, dir, compress12);
  LoadFullGaugeDir(ivector, u_gauge, gBase, gOffs, dir, compress12);

  if (compress12) {
    // printf("Using Compressed Gauges\n");
    for (int c = 0; c < 3; c++) {
      Conj_CrossProd(ivector,
                     u_gauge[2][c],
                     u_gauge[0][(c + 1) % 3],
                     u_gauge[1][(c + 2) % 3],
                     u_gauge[0][(c + 2) % 3],
                     u_gauge[1][(c + 1) % 3],
                     mask);
    }
  }
}

This cross product is implemented here:

// r = (s1*s2-s3*s4)'
//r[RE] = (s1[RE]*s2[RE])-(s1[IM]*s2[IM])-(s3[RE]*s4[RE])+(s3[IM]*s4[IM])
//r[IM] = (s3[RE]*s4[IM])+(s3[IM]*s4[RE])-(s1[RE]*s2[IM])-(s1[IM]*s2[RE])
void Conj_CrossProd(InstVector &ivector,
                    FVec *r,
                    FVec *s1,
                    FVec *s2,
                    FVec *s3,
                    FVec *s4,
                    string &mask)
{
  mulFVec(ivector, r[RE], s1[RE], s2[RE], mask);
  fnmaddFVec(ivector, r[RE], s1[IM], s2[IM], r[RE], mask);
  fnmaddFVec(ivector, r[RE], s3[RE], s4[RE], r[RE], mask);
  fmaddFVec(ivector, r[RE], s3[IM], s4[IM], r[RE], mask);

  mulFVec(ivector, r[IM], s3[RE], s4[IM], mask);
  fmaddFVec(ivector, r[IM], s3[IM], s4[RE], r[IM], mask);
  fnmaddFVec(ivector, r[IM], s1[RE], s2[IM], r[IM], mask);
  fnmaddFVec(ivector, r[IM], s1[IM], s2[RE], r[IM], mask);
}

The introduction of this phase factor looks pretty straightforward in loadGaugeDir. I think Conj_CrossProd should not to this multiplication because its name suggests a single responsibility.

There should be two loadBroadcastScalar that load the ream and imaginary part of the phase into a vector register. Then one could use mulFVec and fmaddFVec for the complex multiplication. We have to do a complex multiplication:

e + if = (a + ib) (c + id)

e = ac - bd
f = ad + bc

Using cliché functions, that will probably be just this (t is a temporary):

t = a * c       // mul
e = t - b * d   // fnma

t = a * d       // mul
f = b * c + t   // fma

All in all this looks like something I could do. We just have to decide on the details.

kostrzewa commented 7 years ago

Since we are dealing with (up to) four complex multiplications, one should probably generate multiple kernels.

The four complex phases should probably not be tracked as part of the gauge field, but rather as part of the Dirac operator, much like the anisotropy and boundary condition parameters, since the same gauge field should be useable by multiple Dirac operators without having to instantiate multiple gauge fields.

martin-ueding commented 7 years ago

Since we are dealing with four complex multiplications, one should probably generate multiple kernels.

Okay, that sounds like a good idea.

However, we are 433M of kernel code at the moment. It will not exactly double because only the Dslash and achimbdspi kernels have to be duplicated. The face packers only work with the spinors, so there should be no changes there. I would guess that we end up with like 600M of kernel code?

Do we want to consider generating the kernels during the QPhiX compilation before doing this? Or do we want to just continue checking in the kernels?

but rather as part of the Dirac operator, much like the anisotropy and boundary condition parameters

I like this. This means that the operators just get some more member variables that can be passed to the kernel in the innermost routine.

I have the feeling that it would make sense to complete issue #40 first. That would reduce the points where I have to pass the complex phase to the kernel from 16 (4 kernel families × 2 operations × 2 plus/minus) to 8.

bjoo commented 7 years ago

Hi All, Indeed the current boundary condition params, are exactly because of the inability of SU(3) reconstruction in the presence of a phase. Right now, I can’t remember exactly whether they are boolean or int, but I believe they get turned into doubles at some point. There are two factors currently, spatial and temporal if memory serves correctly. We could generalize this to 4 factors ( X,Y,Z,T ) at the risk of changing the generated code and extending slightly the interfaces. Or, do you need these to be per-site?

Best, B

On May 29, 2017, at 6:01 AM, Martin Ueding notifications@github.com wrote:

Since we are dealing with four complex multiplications, one should probably generate multiple kernels.

Okay, that sounds like a good idea.

However, we are 433M of kernel code at the moment. It will not exactly double because only the Dslash and achimbdspi kernels have to be duplicated. The face packers only work with the spinors, so there should be no changes there. I would guess that we end up with like 600M of kernel code?

Do we want to consider generating the kernels during the QPhiX compilation before doing this? Or do we want to just continue checking in the kernels?

but rather as part of the Dirac operator, much like the anisotropy and boundary condition parameters

I like this. This means that the operators just get some more member variables that can be passed to the kernel in the innermost routine.

I have the feeling that it would make sense to complete issue #40 first. That would reduce the points where I have to pass the complex phase to the kernel from 16 (4 kernel families × 2 operations × 2 plus/minus) to 8.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 7 years ago

The twisted boundary conditions multiply the gauge field at every spacetime point with a complex phase that is like \exp(\pi / T) such that it gives a sign change once one goes through the whole time extent T.

I currently think that the best place in the kernel to add the twisted boundary conditions would be to add them to the gauge-spinor muliplication routine:

void matMultVec(InstVector &ivector, bool adjMul, int s)
{
  string mask;

  for (int c1 = 0; c1 < 3; c1++) {
    if (!adjMul) {
      mulCVec(ivector,
              ub_spinor[s][c1],
              u_gauge[0][c1],
              b_spinor[s][0],
              mask);
      fmaddCVec(ivector,
                ub_spinor[s][c1],
                u_gauge[1][c1],
                b_spinor[s][1],
                ub_spinor[s][c1],
                mask);
      fmaddCVec(ivector,
                ub_spinor[s][c1],
                u_gauge[2][c1],
                b_spinor[s][2],
                ub_spinor[s][c1],
                mask);
    } else {
      mulConjCVec(ivector,
                  ub_spinor[s][c1],
                  u_gauge[c1][0],
                  b_spinor[s][0],
                  mask);
      fmaddConjCVec(ivector,
                    ub_spinor[s][c1],
                    u_gauge[c1][1],
                    b_spinor[s][1],
                    ub_spinor[s][c1],
                    mask);
      fmaddConjCVec(ivector,
                    ub_spinor[s][c1],
                    u_gauge[c1][2],
                    b_spinor[s][2],
                    ub_spinor[s][c1],
                    mask);
    }
  }
}

After the matrix multiplication, one could just do a complex multiplication with the phase and that would be it.

martin-ueding commented 7 years ago

This issue is continued in #60.

JeffersonLab / qphix

Implement twisted boundary conditions #44

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org