Closed martin-ueding closed 7 years ago
I have tried the reconstruction process and checked whether it preserves the global phase. It does not. You can read more in the Mathematica notebook: Gauge_Compression_Twisted.pdf
There is a need for a global phase factor in the interface somewhere. The generated kernels have to apply the complex phase multiplication in the reconstruction process which happens on-the-fly. Most likely there is no fused-multiply-add that I could hijack, so this will require more floating point operations. In order to retain performance for simple boundary conditions, one must generate another set of kernels that do the extra three multiplications in the reconstruction process.
Alternatively we could say that our arithmetic intensity is low anyway, so three additional multiplications with 1.0 in the simple boundary case is not a problem. In that case, I would just change the existing kernels.
@bjoo, @kostrzewa, how do you feel about wasting three flops vs. generating more kernels?
On the non-kernel side, I think that the global phase should not be in the
Geometry
object. It is something that only concerns the gauge field. One
might want to have multiple gauge fields with different boundary conditions for
tests. The functions don't use a RAII container for the gauge field. Using
relatively new GaugeHandle
class and a double phase[2]
there would seem a
more appropriate choice. Adding this phase as additional parameters for all the
functions seems like a mess, it would lose the cohesion between the gauge field
and the phase.
This would only be used by tmLQCD in the beginning, therefore one could just
redefine all the interfaces with an enhanced GaugeHandle
instead of C-arrays
of Geometry::SU3MatrixBlock
. If the twisted boundary conditions were to be
supported in Chroma, that would be a good opportunity to remove the manual
allocation/free of QPhiX gauge field but using a RAII container.
This is the kernel generator code that loads the gauge.
void loadGaugeDir(InstVector &ivector, int dir, bool compress12)
{
string mask;
PrefetchL1FullGaugeDirIn(ivector, gBase, gOffs, dir, compress12);
LoadFullGaugeDir(ivector, u_gauge, gBase, gOffs, dir, compress12);
if (compress12) {
// printf("Using Compressed Gauges\n");
for (int c = 0; c < 3; c++) {
Conj_CrossProd(ivector,
u_gauge[2][c],
u_gauge[0][(c + 1) % 3],
u_gauge[1][(c + 2) % 3],
u_gauge[0][(c + 2) % 3],
u_gauge[1][(c + 1) % 3],
mask);
}
}
}
This cross product is implemented here:
// r = (s1*s2-s3*s4)'
//r[RE] = (s1[RE]*s2[RE])-(s1[IM]*s2[IM])-(s3[RE]*s4[RE])+(s3[IM]*s4[IM])
//r[IM] = (s3[RE]*s4[IM])+(s3[IM]*s4[RE])-(s1[RE]*s2[IM])-(s1[IM]*s2[RE])
void Conj_CrossProd(InstVector &ivector,
FVec *r,
FVec *s1,
FVec *s2,
FVec *s3,
FVec *s4,
string &mask)
{
mulFVec(ivector, r[RE], s1[RE], s2[RE], mask);
fnmaddFVec(ivector, r[RE], s1[IM], s2[IM], r[RE], mask);
fnmaddFVec(ivector, r[RE], s3[RE], s4[RE], r[RE], mask);
fmaddFVec(ivector, r[RE], s3[IM], s4[IM], r[RE], mask);
mulFVec(ivector, r[IM], s3[RE], s4[IM], mask);
fmaddFVec(ivector, r[IM], s3[IM], s4[RE], r[IM], mask);
fnmaddFVec(ivector, r[IM], s1[RE], s2[IM], r[IM], mask);
fnmaddFVec(ivector, r[IM], s1[IM], s2[RE], r[IM], mask);
}
The introduction of this phase factor looks pretty straightforward in
loadGaugeDir
. I think Conj_CrossProd
should not to this multiplication
because its name suggests a single responsibility.
There should be two loadBroadcastScalar
that load the ream and imaginary part
of the phase into a vector register. Then one could use mulFVec
and
fmaddFVec
for the complex multiplication. We have to do a complex multiplication:
e + if = (a + ib) (c + id)
e = ac - bd
f = ad + bc
Using cliché functions, that will probably be just this (t
is a temporary):
t = a * c // mul
e = t - b * d // fnma
t = a * d // mul
f = b * c + t // fma
All in all this looks like something I could do. We just have to decide on the details.
Since we are dealing with (up to) four complex multiplications, one should probably generate multiple kernels.
The four complex phases should probably not be tracked as part of the gauge field, but rather as part of the Dirac operator, much like the anisotropy and boundary condition parameters, since the same gauge field should be useable by multiple Dirac operators without having to instantiate multiple gauge fields.
Since we are dealing with four complex multiplications, one should probably generate multiple kernels.
Okay, that sounds like a good idea.
However, we are 433M of kernel code at the moment. It will not exactly double because only the Dslash and achimbdspi kernels have to be duplicated. The face packers only work with the spinors, so there should be no changes there. I would guess that we end up with like 600M of kernel code?
Do we want to consider generating the kernels during the QPhiX compilation before doing this? Or do we want to just continue checking in the kernels?
but rather as part of the Dirac operator, much like the anisotropy and boundary condition parameters
I like this. This means that the operators just get some more member variables that can be passed to the kernel in the innermost routine.
I have the feeling that it would make sense to complete issue #40 first. That would reduce the points where I have to pass the complex phase to the kernel from 16 (4 kernel families × 2 operations × 2 plus/minus) to 8.
Hi All, Indeed the current boundary condition params, are exactly because of the inability of SU(3) reconstruction in the presence of a phase. Right now, I can’t remember exactly whether they are boolean or int, but I believe they get turned into doubles at some point. There are two factors currently, spatial and temporal if memory serves correctly. We could generalize this to 4 factors ( X,Y,Z,T ) at the risk of changing the generated code and extending slightly the interfaces. Or, do you need these to be per-site?
Best, B
On May 29, 2017, at 6:01 AM, Martin Ueding notifications@github.com wrote:
Since we are dealing with four complex multiplications, one should probably generate multiple kernels.
Okay, that sounds like a good idea.
However, we are 433M of kernel code at the moment. It will not exactly double because only the Dslash and achimbdspi kernels have to be duplicated. The face packers only work with the spinors, so there should be no changes there. I would guess that we end up with like 600M of kernel code?
Do we want to consider generating the kernels during the QPhiX compilation before doing this? Or do we want to just continue checking in the kernels?
but rather as part of the Dirac operator, much like the anisotropy and boundary condition parameters
I like this. This means that the operators just get some more member variables that can be passed to the kernel in the innermost routine.
I have the feeling that it would make sense to complete issue #40 first. That would reduce the points where I have to pass the complex phase to the kernel from 16 (4 kernel families × 2 operations × 2 plus/minus) to 8.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
The twisted boundary conditions multiply the gauge field at every spacetime point with a complex phase that is like \exp(\pi / T)
such that it gives a sign change once one goes through the whole time extent T
.
I currently think that the best place in the kernel to add the twisted boundary conditions would be to add them to the gauge-spinor muliplication routine:
void matMultVec(InstVector &ivector, bool adjMul, int s)
{
string mask;
for (int c1 = 0; c1 < 3; c1++) {
if (!adjMul) {
mulCVec(ivector,
ub_spinor[s][c1],
u_gauge[0][c1],
b_spinor[s][0],
mask);
fmaddCVec(ivector,
ub_spinor[s][c1],
u_gauge[1][c1],
b_spinor[s][1],
ub_spinor[s][c1],
mask);
fmaddCVec(ivector,
ub_spinor[s][c1],
u_gauge[2][c1],
b_spinor[s][2],
ub_spinor[s][c1],
mask);
} else {
mulConjCVec(ivector,
ub_spinor[s][c1],
u_gauge[c1][0],
b_spinor[s][0],
mask);
fmaddConjCVec(ivector,
ub_spinor[s][c1],
u_gauge[c1][1],
b_spinor[s][1],
ub_spinor[s][c1],
mask);
fmaddConjCVec(ivector,
ub_spinor[s][c1],
u_gauge[c1][2],
b_spinor[s][2],
ub_spinor[s][c1],
mask);
}
}
}
After the matrix multiplication, one could just do a complex multiplication with the phase and that would be it.
This issue is continued in #60.
For tmLQCD we want to have twisted boundary conditions. This will be a global phase that is multiplied to the gauge field at every lattice site.
The 12-parameter gauge compression is likely to remove the global phase from the gauge field. Therefore changes to the kernels seem to be needed as well.
Geometry
class, perhaps in theGaugeHandle
.