Closed mrange closed 4 years ago
/cc @tannergooding
CC. @CarolEidt, @AndyAyersMS
Is this already covered under the "First Class Struct" work? This basically looks like a case where the JIT isn't realizing that the V3
struct is just a convenience wrapper and that the fields could be treated as locals for the the lifetime of the function.
Hmm, I would have expected all the V3
locals to be promoted and that should have eliminated all the copying. Let me take a closer look.
We do fully promote, but certain promoted temps are blocked from enregistration, and this causes the odd looking store / reload blocks:
fgMorphCopyBlock:
The assignment [000199] using V74 removes: Constant Assertion: V74 == 0
block assignment to morph:
[000195] -----+------ /--* LCL_VAR simd32 V15 tmp3
[000199] -A---------- * ASG simd32 (copy)
[000198] n----+-N---- \--* BLK(32) simd32
[000197] -----+------ \--* ADDR byref
[000196] D----+-N---- \--* LCL_VAR simd32 V74 tmp62
this requires a CopyBlock.
Local V74 should not be enregistered because: written in a block op
...
; V74 tmp62 [V74,T44] ( 2, 8 ) simd32 -> [rsp+0x280] do-not-enreg[SB] V14.X(offs=0x00) P-INDEP "field V14.X (fldOffset=0x0)"
; V75 tmp63 [V75,T45] ( 2, 8 ) simd32 -> [rsp+0x260] do-not-enreg[SB] V14.Y(offs=0x20) P-INDEP "field V14.Y (fldOffset=0x20)"
; V76 tmp64 [V76,T46] ( 2, 8 ) simd32 -> [rsp+0x240] do-not-enreg[SB] V14.Z(offs=0x40) P-INDEP "field V14.Z (fldOffset=0x40)"
...
C57D11A42480020000 vmovupd ymmword ptr[rsp+280H], ymm12
C57D11AC2460020000 vmovupd ymmword ptr[rsp+260H], ymm13
C57D11B42440020000 vmovupd ymmword ptr[rsp+240H], ymm14
C57D10A42480020000 vmovupd ymm12, ymmword ptr[rsp+280H]
C57D10AC2460020000 vmovupd ymm13, ymmword ptr[rsp+260H]
C57D10B42440020000 vmovupd ymm14, ymmword ptr[rsp+240H]
Not sure yet why these temps are blocked and not others.
Looks like all the blocked cases are from the various new
s that get inlined. Perhaps the local address visitor should more aggressively simplify....
LocalAddressVisitor visiting statement:
[000200] ------------ * STMT void (IL 0x00D... ???)
[000195] ------------ | /--* LCL_VAR simd32 V15 tmp3
[000199] -A---------- \--* ASG simd32 (copy)
[000198] ------------ \--* BLK(32) simd32
[000197] ------------ \--* ADDR byref
[000196] ------------ \--* FIELD struct X
[000193] ------------ \--* ADDR byref
[000194] ------------ \--* LCL_VAR struct(P) V14 tmp2
\--* simd32 V14.X (offs=0x00) -> V74 tmp62
\--* simd32 V14.Y (offs=0x20) -> V75 tmp63
\--* simd32 V14.Z (offs=0x40) -> V76 tmp64
Replacing the field in promoted struct with local var V74
LocalAddressVisitor modified statement:
[000200] ------------ * STMT void (IL 0x00D... ???)
[000195] ------------ | /--* LCL_VAR simd32 V15 tmp3
[000199] -A---------- \--* ASG simd32 (copy)
[000198] ------------ \--* BLK(32) simd32
[000197] ------------ \--* ADDR byref
[000196] ------------ \--* LCL_VAR simd32 V74 tmp62
would like to see that last tree just collapse to an assignment of locals.
cc @mikedn who I think has looked at something like this.
This looks like a fgMorphCopyBlock
issue, it managed to produce a non-block copy but not before marking the variable as DNER. I'll take a look later today.
Yes, it can also be done in LocalAddressVistor
but so far I've been hesitant about adding this kind of stuff to it, pending further investigation and decision regarding moving struct promotion out of it.
And then there's probably the main root cause of this issue - the presence of a block op from the beginning. That may be caused by the fact that it's a struct typed struct field and JIT's LCL_FLD
cannot be properly used in such cases because it doesn't maintain struct type information. I'm working on adding such support.
Seems roughly like we just need to call fgMorphBlockOperand
earlier in fgMorphCopyBlock
, then we'll fall into the reg struct exemption from DNER.
Not sure yet, it looks more like an fgMorphOneAsgBlockOp
issue to me. It doesn't quite recognize SIMD assignments.
Yes, it can also be done in LocalAddressVistor but so far I've been hesitant about adding this kind of stuff to it, pending further investigation and decision regarding moving struct promotion out of it.
And it turns out that it's not enough to do this in LocalAddressVisitor
(which is trivial). Even if it generates
[000195] ------------ | /--* LCL_VAR simd32 V15 tmp3
[000199] -A---------- \--* ASG simd32 (copy)
[000198] ------------ \--* LCL_VAR simd32 V74 tmp62
fgMorphCopyBlock
still makes V74 DNER. So we need to fix it anyway. Oh well.
I am not sure from the conversation if this kind of optimization is something you wish the jitter to handle but is there some way to change my code to get more optimized code without needing to manually inline every SIMD call?
@mrange I'm not aware of any other way that doesn't require manual inlining. The JIT should definitely learn to handle structs better, it has a rather long history of failing in this area and it doesn't have to be this way.
@AndyAyersMS I have a tentative 3 line fix in https://github.com/mikedn/coreclr/commit/2f79dfdd7af2858d52cbdd9d73d54771af5b5a05. But I need to read that code more carefully to be sure. What would be the deadline for getting such a fix in .NET 3?
There is a fair amount of overhead in passing and returning structs. You can minimize some of this in sources by using in
and ref
, but you may not like how your code looks that way either.
If you don't want to inline the basic operators then you might find it works better to use arrays (or perhaps spans) as your top level aggregates, though you will lose the convenience of being able to refer to things by name (you'd have 0, 1, 2 instead of X, Y, Z).
in
did seem to help. I was considering to try ref returns but then I think I have to switch from +
to +=
in semantics?
Part of the reason to use struct
is to eliminate the need for heap objects as +
etc will create new instances. If I use array
wouldn't that imply more heap objects? Ofc if I need to go for mutating operators like +=
maybe the heap overhead can be mitigated..
But if I use your suggestion of spans perhaps the spans can reference into a big array. Hmm. Worth some experimenting
I think I have to switch from + to += in semantics?
Be careful with +=
operators. The C# compiler tends to generate code that the JIT can't handle well and leaves variables address exposed. Sometimes that's a lot worse than a few extra copies.
Tried a variant with methods like this:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ref V3 Add(ref V3 l, in V3 r)
{
l.X = Avx.Add(l.X, r.X);
l.Y = Avx.Add(l.Y, r.Y);
l.Z = Avx.Add(l.Z, r.Z);
return ref l;
}
The outcome was worse than the OP version. The disassembly indicated that operations were never register to register but rather register to memory. So that doesn't seem like a fruitful alternative.
(Perhaps obvious, but it wasn't to me)
@mikedn we're ok with fixes going into 3.0 for the next few weeks, so put up a PR when you have something you feel good about.
@AndyAyersMS should we close the discussion here and let new PRs be created in CoreCLR, or is it worth moving this issue to CoreCLR repo?
@CarolEidt has a pending PR in the same area, not sure if that fixes this issue or not.
IMO this should be moved to coreclr, it's a typical JIT issue.
Went ahead and moved it over here. I would have expected some kind of explicit forwarding pointer over in dotnet/core, but the old issue just vanishes there. Hopefully nobody's too confused.
Marking as future but if we end up with a simple fix soon we can probably get it into 3.0.
cc @dotnet/jit-contrib
I verified that @CarolEidt 's work in dotnet/coreclr#22255 does indeed fixes this issue:
G_M27407_IG03:
C54C59E6 vmulps ymm12, ymm6, ymm6
C54459EF vmulps ymm13, ymm7, ymm7
C4413C59F0 vmulps ymm14, ymm8, ymm8
C4413459F9 vmulps ymm15, ymm9, ymm9
C4C12C59EA vmulps ymm5, ymm10, ymm10
C4C12459E3 vmulps ymm4, ymm11, ymm11
C4414C59C9 vmulps ymm9, ymm6, ymm9
C4414459D2 vmulps ymm10, ymm7, ymm10
C4413C59DB vmulps ymm11, ymm8, ymm11
C4C11C5CF7 vsubps ymm6, ymm12, ymm15
C5945CED vsubps ymm5, ymm13, ymm5
C58C5CE4 vsubps ymm4, ymm14, ymm4
C5CC58F0 vaddps ymm6, ymm6, ymm0
C5D458F9 vaddps ymm7, ymm5, ymm1
C55C58C2 vaddps ymm8, ymm4, ymm2
C4C13458E1 vaddps ymm4, ymm9, ymm9
C4C12C58EA vaddps ymm5, ymm10, ymm10
C4412458CB vaddps ymm9, ymm11, ymm11
C5DC58E3 vaddps ymm4, ymm4, ymm3
C57D10642440 vmovupd ymm12, ymmword ptr[rsp+40H]
C4415458D4 vaddps ymm10, ymm5, ymm12
C5FD106C2420 vmovupd ymm5, ymmword ptr[rsp+20H]
C53458DD vaddps ymm11, ymm9, ymm5
C57C28CC vmovaps ymm9, ymm4
C57D111C24 vmovupd ymmword ptr[rsp], ymm11
FFC2 inc edx
81FA80969800 cmp edx, 0x989680
C57D11642440 vmovupd ymmword ptr[rsp+40H], ymm12
C5FD116C2420 vmovupd ymmword ptr[rsp+20H], ymm5
0F8C31010000 jl G_M27407_IG06
That's quite exciting news. When/how can I test it?
@CarolEidt's work also makes the codegen of PacketTracer much better.
Diff on GetPoints
(collected on OSX)
--- masterGetPoints.asm 2019-03-18 21:22:47.000000000 -0700
+++ CarolGetPoints.asm 2019-03-18 21:23:26.000000000 -0700
@@ -1,14 +1,13 @@
push rbp
-sub rsp, 496
vzeroupper
-lea rbp, [rsp+1F0H]
+mov rbp, rsp
vxorps xmm0, xmm0
vcvtsi2ss xmm0, dword ptr [rdi+16]
vbroadcastss ymm0, ymm0
vxorps xmm1, xmm1
vcvtsi2ss xmm1, dword ptr [rdi+20]
vbroadcastss ymm1, ymm1
-mov rax, 0x191370B58
+mov rax, 0x193608B58
mov rax, gword ptr [rax]
add rax, 8
vdivps ymm2, ymm0, ymmword ptr[rax]
@@ -34,21 +33,9 @@
vmulps ymm5, ymm0, ymm5
vmulps ymm6, ymm0, ymm6
vmulps ymm0, ymm0, ymm7
-vmovupd ymmword ptr[V77 rbp-30H], ymm5
-vmovupd ymmword ptr[V78 rbp-50H], ymm6
-vmovupd ymmword ptr[V79 rbp-70H], ymm0
-vmovupd ymm0, ymmword ptr[V77 rbp-30H]
-vmovupd ymm5, ymmword ptr[V78 rbp-50H]
-vmovupd ymm6, ymmword ptr[V79 rbp-70H]
-vaddps ymm0, ymm2, ymm0
-vaddps ymm2, ymm3, ymm5
-vaddps ymm3, ymm4, ymm6
-vmovupd ymmword ptr[V86 rbp-90H], ymm0
-vmovupd ymmword ptr[V87 rbp-B0H], ymm2
-vmovupd ymmword ptr[V88 rbp-D0H], ymm3
-vmovupd ymm0, ymmword ptr[V86 rbp-90H]
-vmovupd ymm2, ymmword ptr[V87 rbp-B0H]
-vmovupd ymm3, ymmword ptr[V88 rbp-D0H]
+vaddps ymm2, ymm2, ymm5
+vaddps ymm3, ymm3, ymm6
+vaddps ymm0, ymm4, ymm0
add rdx, 200
vmovupd ymm4, ymmword ptr[rdx]
vmovupd ymm5, ymmword ptr[rdx+32]
@@ -56,44 +43,22 @@
vmulps ymm4, ymm1, ymm4
vmulps ymm5, ymm1, ymm5
vmulps ymm1, ymm1, ymm6
-vmovupd ymmword ptr[V92 rbp-F0H], ymm4
-vmovupd ymmword ptr[V93 rbp-110H], ymm5
-vmovupd ymmword ptr[V94 rbp-130H], ymm1
-vmovupd ymm1, ymmword ptr[V92 rbp-F0H]
-vmovupd ymm4, ymmword ptr[V93 rbp-110H]
-vmovupd ymm5, ymmword ptr[V94 rbp-130H]
-vaddps ymm0, ymm0, ymm1
-vaddps ymm1, ymm2, ymm4
-vaddps ymm2, ymm3, ymm5
-vmovupd ymmword ptr[V101 rbp-150H], ymm0
-vmovupd ymmword ptr[V102 rbp-170H], ymm1
-vmovupd ymmword ptr[V103 rbp-190H], ymm2
-vmovupd ymm0, ymmword ptr[V101 rbp-150H]
-vmovupd ymm1, ymmword ptr[V102 rbp-170H]
-vmovupd ymm2, ymmword ptr[V103 rbp-190H]
-vmovaps ymm3, ymm0
-vmovaps ymm4, ymm1
-vmovaps ymm5, ymm2
-vmulps ymm4, ymm4, ymm1
-vmulps ymm5, ymm5, ymm2
-vmulps ymm3, ymm3, ymm0
-vaddps ymm3, ymm3, ymm4
+vaddps ymm2, ymm2, ymm4
vaddps ymm3, ymm3, ymm5
-vsqrtps ymm3, ymm3
-vdivps ymm0, ymm0, ymm3
-vdivps ymm1, ymm1, ymm3
-vdivps ymm2, ymm2, ymm3
-vmovupd ymmword ptr[V104 rbp-1B0H], ymm0
-vmovupd ymmword ptr[V105 rbp-1D0H], ymm1
-vmovupd ymmword ptr[V106 rbp-1F0H], ymm2
-vmovupd ymm0, ymmword ptr[V104 rbp-1B0H]
-vmovupd ymmword ptr[rsi], ymm0
-vmovupd ymm0, ymmword ptr[V105 rbp-1D0H]
-vmovupd ymmword ptr[rsi+32], ymm0
-vmovupd ymm0, ymmword ptr[V106 rbp-1F0H]
+vaddps ymm0, ymm0, ymm1
+vmulps ymm1, ymm3, ymm3
+vmulps ymm4, ymm0, ymm0
+vmulps ymm5, ymm2, ymm2
+vaddps ymm1, ymm5, ymm1
+vaddps ymm1, ymm1, ymm4
+vsqrtps ymm1, ymm1
+vdivps ymm2, ymm2, ymm1
+vdivps ymm3, ymm3, ymm1
+vdivps ymm0, ymm0, ymm1
+vmovupd ymmword ptr[rsi], ymm2
+vmovupd ymmword ptr[rsi+32], ymm3
vmovupd ymmword ptr[rsi+64], ymm0
mov rax, rsi
vzeroupper
-lea rsp, [rbp]
pop rbp
ret
I believe larger functions like GetNaturalColor
would benefit more.
I've re-synced dotnet/coreclr#22255 and am working through some new issues. I hope to get this in for 3.0, but will have to see what complexities arise.
The removed -lea rsp, [rbp]
I suppose is because there's no data in the stackframe and no need to restore the stackpointer?
because there's no data in the stackframe and no need to restore the stackpointer?
Yes, because the program no longer need to allocate stack space via -sub rsp, 496
for spilled local variables, so rsp
is unchanged in this function body.
Fixed with dotnet/coreclr#22255
Struct type overhead higher than expected for small numerical vectors
General
I am running dotnet 3.0 preview on Windows 10:
My CPU is:
Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
(so it doesn't have FMA support but do support AVX).
I have been messing around with the new intrinsics support i in dotnet core 3.0 quite a lot and had some success with it. I want to write Raymarcher using SIMD AVX.
Todo so I declare a SIMD V3 struct type:
I know there are SIMD enabled types in
System.Numerics.Vector
but I want to do my own custom struct SIMD types.When I inspect the disassembly code I find code that seems to do nothing except slowing down performance. I can work-around these issues by inlining all code but naturally I don't want to do that as that will significantly complicate my raymarchers.
I have the full code attached below.
I declared a few operators like so:
My inner loop looks like this
When looking into the disassembly I find code that is odd to me:
So what seems odd to me is saving state to the stack and then immedietly reloading it and never looking at the saved state again. Perhaps one could argue that
qre
andqim
needs visibility because I named the variables (lvalue expressions to borrow a term from c++) but it also seems intermediate results are stored on the stack (rvalue expressions).I was helped somewhat by adding
in
that did eliminate some code but no the unnecessary writes to stack (unnecessary as it seems to me).If inline all operations so that my inner loop looks like this:
Then the disassembly looks more appealing and performs 3x faster.
For F# the disassembly looks even worse.
Perhaps there are some obvious flags I don't know of that I should have set on my struct types to enable the jitter to eliminate the intermediate results. I am happy with such a solution.
Full C# example:
project file: