JeffersonLab / qphix

QCD for Intel Xeon Phi and Xeon processors
http://jeffersonlab.github.io/qphix/
Other
13 stars 11 forks source link

SoA length as an absolute compile time value seems strange #52

Closed martin-ueding closed 7 years ago

martin-ueding commented 7 years ago

In the code generator, there seem to be only two cases implemented: SOALEN == VECLEN und SOALEN == VECLEN / 2. In other words, we can only have nGY == 1 and nGY == 2.

When compiling QPhiX, the user has to pass the SoA length to configure. This is something that I do not particularly understand. Why is the SoA length fixed and not nGY? The consequence is that if you want to use vector folding, i.e. nGY == 2, you would use SOALEN == VECLEN / 2. The strange thing is that the vector length depends on the float type, but SoA length does not. This means that if you select nGY == 1 in double, you will get nGY == 2 in float.

In the test and timing codes, there are mostly lines like the following:

runTest<float, VECLEN_SP, QPHIX_SOALEN, true>(lattSize, qmp_geom);

The VECLEN_SP will be some particular value on the current ISA. the QPHIX_SOALEN was chosen by the user during compile time of the tests. If I have chosen a SoA length such that I get nGY == 2 for double, I will get nGY == 4 in float. But that is not implemented!

That is, in my twisted-bc branch it is not implemented. The previous kernel code was such that the general template was not just a declaration but a function definition with some masterPrintf saying that this combination of vector length and SoA lenght is not implemented. I found this bad because it will give you a runtime error when you try to use this eventually but lets all your code compile, even though it uses not-implemented kernels. I have changed those to function declarations only, giving linker errors when one tries to use those things.

On an AVX2 machine I have now the following linker error:

undefined reference to `void QPhiX::dslash_plus_vec<float, 8, 2, true, true, true, true, true>

The last four bool template arguments are from the twisted boundary conditions. The usual four are these now: float, 8, 2, true. This means nGY == 4 which is not implemented. Therefore I get the linker error.

This is only caused because in the tests one uses the same QPHIX_SOALEN for half, single, and double precision. The quick fix would be just do add a preprocessor statement around this and only call that if it will call an implemented kernel.

#if VECLEN_SP == QPHIX_SOALEN || VECLEN_SP == 2 * QPHIX_SOALEN
runTest<float, VECLEN_SP, QPHIX_SOALEN, true>(lattSize, qmp_geom);
#endif

But why is SoA length a compile time constant of the test code anyway? The other template parameter, namely veclen and compress12 are not needed at compile time. For the gauge compression, it just instantiates both during compile time and then picks one at run time:

    if (compress12) {
      runTest<float, VECLEN_SP, QPHIX_SOALEN, true>(lattSize, qmp_geom);
    } else {
      runTest<float, VECLEN_SP, QPHIX_SOALEN, false>(lattSize, qmp_geom);
    }

So why is SoA length a compile time parameter for the tests? I think it would be easier if that was a command line parameter, perhaps even parameterized as nGY like so:

    if (ngy == 1) {
      if (compress12) {
        runTest<float, VECLEN_SP, QPHIX_SOALEN, true>(lattSize, qmp_geom);
      } else {
        runTest<float, VECLEN_SP, QPHIX_SOALEN, false>(lattSize, qmp_geom);
      }
    } else {
      if (compress12) {
        runTest<float, VECLEN_SP, VECLEN_SP / 2, true>(lattSize, qmp_geom);
      } else {
        runTest<float, VECLEN_SP, VECLEN_SP / 2, false>(lattSize, qmp_geom);
      }
    }

This would have the added benefit that one could quickly compare various SoA lengths against each other without having to configure with different parameters again.

With my changes to the build process, all kernels get compiled anyway. So there is no saving in compilation time by just compiling for one particular SoA length. Would you agree that making SoA length a run time option would be a good idea?

kostrzewa commented 7 years ago

In the code generator, there seem to be only two cases implemented: SOALEN == VECLEN und SOALEN == VECLEN / 2. In other words, we can only have nGY == 1 and nGY == 2.

Where do you see this? You can set SOALEN==2 and should get nGY==4, no?

martin-ueding commented 7 years ago

One instance would be the old Makefile for the code generator:

avx2:
    mkdir -p ./avx2
    @make clean && make mode=avx PRECISION=2 SOALEN=2 AVX2=1 && ./codegen
    @make clean && make mode=avx PRECISION=2 SOALEN=4 AVX2=1 && ./codegen
    @make clean && make mode=avx PRECISION=1 SOALEN=8 AVX2=1 && ./codegen
    @make clean && make mode=avx PRECISION=1 SOALEN=4 AVX2=1 && ./codegen
    @make clean && make mode=avx PRECISION=1 SOALEN=8 AVX2=1 ENABLE_LOW_PRECISION=1 && ./codegen
@make clean && make mode=avx PRECISION=1 SOALEN=4 AVX2=1 ENABLE_LOW_PRECISION=1 && ./codegen

That is always nGY <= 2.

Or look at snippets from the AVX2 (256 bit) code in the generator for float (veclen is 8):

        if (soalen == 8) {
          buf << v.getName()
              << " =  _mm256_blend_ps(_mm256_loadu_ps("
              << a1->serialize() << "), _mm256_broadcast_ss("
              << a2->serialize() << "), " << (1 << (soalen - 1))
              << ");" << endl;
        } else {
          buf << v.getName() << " =  _mm256_insertf128_ps("
              << v.getName() << ", _mm_blend_ps(_mm_loadu_ps("
              << a1->serialize() << "), _mm_broadcast_ss("
              << a2->serialize() << "), " << (1 << (soalen - 1))
              << "), " << soanum << ");" << endl;
        }

Either you have SoA length 8 (nGY is 1) or you use SoA length 4 (nGY is 2). But there are no further alternatives.

So I conclude that more than nGY = 2 is not supported at all on AVX2.

For AVX512, there is nGY = 4, indeed:

void transpose(InstVector &ivector,
               const FVec r[],
               const FVec f[],
               int soalen)
{
  switch (soalen) {
  case 4:
    transpose4x4(ivector, r, f);
    break;

  case 8:
    transpose2x2(ivector, r, f);
    break;

  case 16:
    transpose1x1(ivector, r, f);
    break;

  default:
    printf("SOALEN = %d Not Supported (only SOALEN = 4, 8 & 16 are "
           "supported)\n",
           soalen);
    exit(1);
  }
}

So it is even more complex: Depending on the architecture, not every SoA length is defined. At least nGY = 2 seems to be defined most of the time (except in the scalar code).

kostrzewa commented 7 years ago

As far as the Makefile is concerned, I agree, but technically one could implement any soalen performance-wise. soalen = veclen/2 is, as far as I understand, the only sensible minimum because it can be implemented with two instrinsics and the resulting restriction on the lattice is not too strict to be impractical. This might change when vector lengths (possibly) grow to 32 or 64 floats. Then it might still make sense to load no more than four or eight floats at a time into these long vectors with multiple chained instructions, but then carry out all multiplications with full vectors.

For vector length 8, if you look at loading instructions like LoadSplitSOAFVec, you see the the case soalen == veclen is treated as special case where the load is carried out in a single instruction (see the loop which calls this instruction). If on the other hand soalen is any other divisor of veclen, you get multiple instructions which load 128-bit vectors.

Anway..

So why is SoA length a compile time parameter for the tests? I think it would be easier if that was a command line parameter

I agree, this would be very handy for testing.

perhaps even parameterized as nGY like so:

that could have consequences, not sure

martin-ueding commented 7 years ago

In the twisted-bc branch, it is now so that there is a -soalen command line option. If one specifies a SoA length that is not implemented, that is an error. Instead of those endless nested if constructs, there are a few template functions that convert CLI arguments into template arguments and then call the appropriate test case.

Parameterization by nGY would now be possible, one could just convert the CLI argument into the SoA length.

I will remove the configure option for SoA length eventuelly. It was not even honored by all the tests, some test cases just did all SoA lengths anyway. It seems rather pointless to build some tests with a single SoA length only.

martin-ueding commented 7 years ago

This has been solved by completely removing the --enable-soalen configure time parameter. However, there is now issue #59 which touches on this again.