Closed gfoidl closed 6 years ago
Maybe it is better to move va += Vector
.Count; after the dot? Should be tried.
This yields
000007fe`7d85e988 410f1008 movups xmm1,xmmword ptr [r8]
000007fe`7d85e98c 410f1011 movups xmm2,xmmword ptr [r9]
000007fe`7d85e990 660f3a41ca31 dppd xmm1,xmm2,31h
000007fe`7d85e996 f20f58c1 addsd xmm0,xmm1
000007fe`7d85e99a 4983c010 add r8,10h
000007fe`7d85e99e 4983c110 add r9,10h
000007fe`7d85e9a2 410f1008 movups xmm1,xmmword ptr [r8]
000007fe`7d85e9a6 410f1011 movups xmm2,xmmword ptr [r9]
000007fe`7d85e9aa 660f3a41ca31 dppd xmm1,xmm2,31h
000007fe`7d85e9b0 f20f58c1 addsd xmm0,xmm1
000007fe`7d85e9b4 4983c010 add r8,10h
000007fe`7d85e9b8 4983c110 add r9,10h
000007fe`7d85e9bc 4183c204 add r10d,4
000007fe`7d85e9c0 453bd3 cmp r10d,r11d
000007fe`7d85e9c3 7cc3 jl 000007fe`7d85e988
and in benchmarks it is 3 of 4 times minimal faster.
Another rearrangement with "better" register locality is given by
fixed (double* a = _vecA)
fixed (double* b = _vecB)
{
double* va = a - Vector<double>.Count;
double* vb = b - Vector<double>.Count;
int i = 0;
if (Vector.IsHardwareAccelerated && n >= Vector<double>.Count * 2)
{
for (; i < n - 2 * Vector<double>.Count; i += 2 * Vector<double>.Count)
{
va += Vector<double>.Count;
vb += Vector<double>.Count;
Vector<double> vecA = Unsafe.Read<Vector<double>>(va);
Vector<double> vecB = Unsafe.Read<Vector<double>>(vb);
dot += Vector.Dot(vecA, vecB);
va += Vector<double>.Count;
vb += Vector<double>.Count;
vecA = Unsafe.Read<Vector<double>>(va);
vecB = Unsafe.Read<Vector<double>>(vb);
dot += Vector.Dot(vecA, vecB);
}
}
which results in
000007fe`7869eb8a 4983c010 add r8,10h
000007fe`7869eb8e 4983c110 add r9,10h
000007fe`7869eb92 410f1008 movups xmm1,xmmword ptr [r8]
000007fe`7869eb96 410f1011 movups xmm2,xmmword ptr [r9]
000007fe`7869eb9a 660f3a41ca31 dppd xmm1,xmm2,31h
000007fe`7869eba0 f20f58c1 addsd xmm0,xmm1
000007fe`7869eba4 4983c010 add r8,10h
000007fe`7869eba8 4983c110 add r9,10h
000007fe`7869ebac 410f1008 movups xmm1,xmmword ptr [r8]
000007fe`7869ebb0 410f1011 movups xmm2,xmmword ptr [r9]
000007fe`7869ebb4 660f3a41ca31 dppd xmm1,xmm2,31h
000007fe`7869ebba f20f58c1 addsd xmm0,xmm1
000007fe`7869ebbe 4183c204 add r10d,4
000007fe`7869ebc2 453bd3 cmp r10d,r11d
000007fe`7869ebc5 7cc3 jl 000007fe`7869eb8a
Here are the add
and movups
more compact -- although this is not really measurable, and for the first iteration the add
is additional and can be saved in the variant at the beginning of this comment. So I'll propose the variant at the beginning of the comment + this one has a cleaner assign for va
and vb
.
Well and instead of repeated patterns like
vecA = Unsafe.Read<Vector<double>>(va);
vecB = Unsafe.Read<Vector<double>>(vb);
va += Vector<double>.Count;
vb += Vector<double>.Count;
there should be a helper method like
private static unsafe Vector<double> GetVector(ref double* ptr)
{
Vector<double> vec = Unsafe.Read<Vector<double>>(ptr);
ptr += Vector<double>.Count;
return vec;
}
which gets inlined anyway. So it is:
vecA = GetVector(ref va);
vecB = GetVector(ref vb);
Much clearer 😄
Current situation
When vectorizing with SIMD a lot of range checks occur. For instance when computing the dotproduct.
Generates (only the first loop shown):
Even without SIMD and manual unrolling the JIT can't elide the bound checks.
Possibility
With unsafe the JIT generates pretty straight code :smile:
Generates (only the first loop shown):
So this generates a pretty tight and clean loop. Take advantage of this.
Maybe it is better to move
va += Vector<double>.Count;
after the dot? Should be tried.