Closed martin-ueding closed 7 years ago
In the code generator, there seem to be only two cases implemented: SOALEN == VECLEN und SOALEN == VECLEN / 2. In other words, we can only have nGY == 1 and nGY == 2.
Where do you see this? You can set SOALEN==2 and should get nGY==4, no?
One instance would be the old Makefile
for the code generator:
avx2:
mkdir -p ./avx2
@make clean && make mode=avx PRECISION=2 SOALEN=2 AVX2=1 && ./codegen
@make clean && make mode=avx PRECISION=2 SOALEN=4 AVX2=1 && ./codegen
@make clean && make mode=avx PRECISION=1 SOALEN=8 AVX2=1 && ./codegen
@make clean && make mode=avx PRECISION=1 SOALEN=4 AVX2=1 && ./codegen
@make clean && make mode=avx PRECISION=1 SOALEN=8 AVX2=1 ENABLE_LOW_PRECISION=1 && ./codegen
@make clean && make mode=avx PRECISION=1 SOALEN=4 AVX2=1 ENABLE_LOW_PRECISION=1 && ./codegen
That is always nGY <= 2
.
Or look at snippets from the AVX2 (256 bit) code in the generator for float (veclen is 8):
if (soalen == 8) {
buf << v.getName()
<< " = _mm256_blend_ps(_mm256_loadu_ps("
<< a1->serialize() << "), _mm256_broadcast_ss("
<< a2->serialize() << "), " << (1 << (soalen - 1))
<< ");" << endl;
} else {
buf << v.getName() << " = _mm256_insertf128_ps("
<< v.getName() << ", _mm_blend_ps(_mm_loadu_ps("
<< a1->serialize() << "), _mm_broadcast_ss("
<< a2->serialize() << "), " << (1 << (soalen - 1))
<< "), " << soanum << ");" << endl;
}
Either you have SoA length 8 (nGY is 1) or you use SoA length 4 (nGY is 2). But there are no further alternatives.
So I conclude that more than nGY = 2 is not supported at all on AVX2.
For AVX512, there is nGY = 4, indeed:
void transpose(InstVector &ivector,
const FVec r[],
const FVec f[],
int soalen)
{
switch (soalen) {
case 4:
transpose4x4(ivector, r, f);
break;
case 8:
transpose2x2(ivector, r, f);
break;
case 16:
transpose1x1(ivector, r, f);
break;
default:
printf("SOALEN = %d Not Supported (only SOALEN = 4, 8 & 16 are "
"supported)\n",
soalen);
exit(1);
}
}
So it is even more complex: Depending on the architecture, not every SoA length is defined. At least nGY = 2 seems to be defined most of the time (except in the scalar code).
As far as the Makefile is concerned, I agree, but technically one could implement any soalen performance-wise. soalen = veclen/2 is, as far as I understand, the only sensible minimum because it can be implemented with two instrinsics and the resulting restriction on the lattice is not too strict to be impractical. This might change when vector lengths (possibly) grow to 32 or 64 floats. Then it might still make sense to load no more than four or eight floats at a time into these long vectors with multiple chained instructions, but then carry out all multiplications with full vectors.
For vector length 8, if you look at loading instructions like LoadSplitSOAFVec
, you see the the case soalen == veclen is treated as special case where the load is carried out in a single instruction (see the loop which calls this instruction). If on the other hand soalen is any other divisor of veclen, you get multiple instructions which load 128-bit vectors.
Anway..
So why is SoA length a compile time parameter for the tests? I think it would be easier if that was a command line parameter
I agree, this would be very handy for testing.
perhaps even parameterized as nGY like so:
that could have consequences, not sure
In the twisted-bc
branch, it is now so that there is a -soalen
command line option. If one specifies a SoA length that is not implemented, that is an error. Instead of those endless nested if
constructs, there are a few template functions that convert CLI arguments into template arguments and then call the appropriate test case.
Parameterization by nGY
would now be possible, one could just convert the CLI argument into the SoA length.
I will remove the configure option for SoA length eventuelly. It was not even honored by all the tests, some test cases just did all SoA lengths anyway. It seems rather pointless to build some tests with a single SoA length only.
This has been solved by completely removing the --enable-soalen
configure time parameter. However, there is now issue #59 which touches on this again.
In the code generator, there seem to be only two cases implemented:
SOALEN == VECLEN
undSOALEN == VECLEN / 2
. In other words, we can only havenGY == 1
andnGY == 2
.When compiling QPhiX, the user has to pass the SoA length to
configure
. This is something that I do not particularly understand. Why is the SoA length fixed and notnGY
? The consequence is that if you want to use vector folding, i.e.nGY == 2
, you would useSOALEN == VECLEN / 2
. The strange thing is that the vector length depends on the float type, but SoA length does not. This means that if you selectnGY == 1
indouble
, you will getnGY == 2
infloat
.In the test and timing codes, there are mostly lines like the following:
The
VECLEN_SP
will be some particular value on the current ISA. theQPHIX_SOALEN
was chosen by the user during compile time of the tests. If I have chosen a SoA length such that I getnGY == 2
fordouble
, I will getnGY == 4
infloat
. But that is not implemented!That is, in my
twisted-bc
branch it is not implemented. The previous kernel code was such that the general template was not just a declaration but a function definition with somemasterPrintf
saying that this combination of vector length and SoA lenght is not implemented. I found this bad because it will give you a runtime error when you try to use this eventually but lets all your code compile, even though it uses not-implemented kernels. I have changed those to function declarations only, giving linker errors when one tries to use those things.On an AVX2 machine I have now the following linker error:
The last four
bool
template arguments are from the twisted boundary conditions. The usual four are these now:float, 8, 2, true
. This meansnGY == 4
which is not implemented. Therefore I get the linker error.This is only caused because in the tests one uses the same
QPHIX_SOALEN
for half, single, and double precision. The quick fix would be just do add a preprocessor statement around this and only call that if it will call an implemented kernel.But why is SoA length a compile time constant of the test code anyway? The other template parameter, namely
veclen
andcompress12
are not needed at compile time. For the gauge compression, it just instantiates both during compile time and then picks one at run time:So why is SoA length a compile time parameter for the tests? I think it would be easier if that was a command line parameter, perhaps even parameterized as
nGY
like so:This would have the added benefit that one could quickly compare various SoA lengths against each other without having to configure with different parameters again.
With my changes to the build process, all kernels get compiled anyway. So there is no saving in compilation time by just compiling for one particular SoA length. Would you agree that making SoA length a run time option would be a good idea?