Closed unpacklo closed 2 years ago
Nice! Can you show the ASM difference between the old and new asuint(int)
? You'd expect a cast to be free, so I'm curious what it did before and what it does now.
I'm curious as to whether you checked the codegen for Burst as well? It's not clear to me what N you are using for your perf tests and whether some of the perf degradations for Burst are noise or actual real things (for which there would be a codegen explanation).
My first thought was "we should have a define for whether this is compiled with Burst!" but then I realized that this is not how Burst works at all, so we'll have to settle for a compromise between Mono and Burst (and that compromise better doesn't hit Mono as bad as our previous compromise).
Nice! Can you show the ASM difference between the old and new
asuint(int)
?
Old:
public static uint asuint(int x)
Address: 000000017CE19200
Code Size in Bytes: 16
Debug mode: enabled
000000017ce19200 48 83 ec 08 sub rsp, 0x8
000000017ce19204 48 89 3c 24 mov [rsp], rdi
000000017ce19208 48 8b c7 mov rax, rdi
000000017ce1920b 48 83 c4 08 add rsp, 0x8 ; 8
000000017ce1920f c3 ret
New:
public static uint asuint(int x)
Address: 000000017CA2CA80
Code Size in Bytes: 16
Debug mode: enabled
000000017ca2ca80 48 83 ec 08 sub rsp, 0x8
000000017ca2ca84 48 89 3c 24 mov [rsp], rdi
000000017ca2ca88 8b 04 24 mov eax, [rsp] ; read singlestep trampoline
000000017ca2ca8b 48 83 c4 08 add rsp, 0x8 ; 8
000000017ca2ca8f c3 ret
I don't quite understand why casting vs type punning makes any difference here, but I also just realized I don't have perf tests for these so I'll add those and see how those perform for this case.
I'm curious as to whether you checked the codegen for Burst as well?
Not for all, but some. For example, here's asfloat(uint3)
.
Old:
.text
.intel_syntax noprefix
.file "main"
.globl "Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.BurstTestFunction(ref Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.Arguments args)_5B581B7AAD6CC3EF" # -- Begin function Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.BurstTestFunction(ref Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.Arguments args)_5B581B7AAD6CC3EF
.p2align 4, 0x90
.type "Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.BurstTestFunction(ref Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.Arguments args)_5B581B7AAD6CC3EF",@function
"Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.BurstTestFunction(ref Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.Arguments args)_5B581B7AAD6CC3EF": # @"Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.BurstTestFunction(ref Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.Arguments args)_5B581B7AAD6CC3EF"
.Lfunc_begin0:
.file 1 "/Users/dale/code/Unity.Mathematics/src/Unity.Mathematics.PerformanceTests/TestMath.gen.cs"
.loc 1 1192 0 # TestMath.gen.cs:1192:0
.cfi_sections .debug_frame
.cfi_startproc
# %bb.0: # %entry
mov rax, -4800000
.p2align 4, 0x90
.LBB0_1: # %BL.0004.i
# =>This Inner Loop Header: Depth=1
.Ltmp0:
.loc 1 1178 49 prologue_end # TestMath.gen.cs:1178:49
mov rcx, qword ptr [rdi + 8]
mov rdx, qword ptr [rdi + 16]
movss xmm0, dword ptr [rdx + rax + 4800008] # xmm0 = mem[0],zero,zero,zero
movsd xmm1, qword ptr [rdx + rax + 4800000] # xmm1 = mem[0],zero
movss dword ptr [rcx + rax + 4800000], xmm1
extractps dword ptr [rcx + rax + 4800004], xmm1, 1
movss dword ptr [rcx + rax + 4800008], xmm0
.loc 1 1182 13 # TestMath.gen.cs:1182:13
add rax, 12
jne .LBB0_1
.Ltmp1:
# %bb.2: # %"Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.CommonTestFunction(ref Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.Arguments args)_A3E8969154AF482D.exit"
.loc 1 1193 13 # TestMath.gen.cs:1193:13
ret
New:
.text
.intel_syntax noprefix
.file "main"
.globl "Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.BurstTestFunction(ref Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.Arguments args)_5B581B7AAD6CC3EF" # -- Begin function Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.BurstTestFunction(ref Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.Arguments args)_5B581B7AAD6CC3EF
.p2align 4, 0x90
.type "Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.BurstTestFunction(ref Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.Arguments args)_5B581B7AAD6CC3EF",@function
"Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.BurstTestFunction(ref Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.Arguments args)_5B581B7AAD6CC3EF": # @"Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.BurstTestFunction(ref Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.Arguments args)_5B581B7AAD6CC3EF"
.Lfunc_begin0:
.file 1 "/Users/dale/code/Unity.Mathematics/src/Unity.Mathematics.PerformanceTests/TestMath.gen.cs"
.loc 1 1192 0 # TestMath.gen.cs:1192:0
.cfi_sections .debug_frame
.cfi_startproc
# %bb.0: # %entry
mov rax, -4800000
.p2align 4, 0x90
.LBB0_1: # %BL.0004.i
# =>This Inner Loop Header: Depth=1
.Ltmp0:
.loc 1 1178 49 prologue_end # TestMath.gen.cs:1178:49
mov rcx, qword ptr [rdi + 8]
mov rdx, qword ptr [rdi + 16]
movss xmm0, dword ptr [rdx + rax + 4800008] # xmm0 = mem[0],zero,zero,zero
movsd xmm1, qword ptr [rdx + rax + 4800000] # xmm1 = mem[0],zero
movss dword ptr [rcx + rax + 4800000], xmm1
extractps dword ptr [rcx + rax + 4800004], xmm1, 1
movss dword ptr [rcx + rax + 4800008], xmm0
.loc 1 1182 13 # TestMath.gen.cs:1182:13
add rax, 12
jne .LBB0_1
.Ltmp1:
# %bb.2: # %"Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.CommonTestFunction(ref Unity.Mathematics.PerformanceTests.TestMath.asfloat_uint3.Arguments args)_A3E8969154AF482D.exit"
.loc 1 1193 13 # TestMath.gen.cs:1193:13
ret
The code is identical in this case which is good but I don't know if this is true for all the code that got changed. I'm going to grab all the bursted code and then see if a diff surfaces anything of interest.
@sschoener I've compared all the burst code gen and there are no diffs with this code change. Regarding the mono generated code for type punning vs casting in the case of asuint(int)
, I haven't been able to determine whether one version is in fact faster than the other on my MacBook Pro, but it seems pretty safe to assume that accessing the stack can only be in the best case as fast as the register move.
DOTS-5396
While working on #201 I noticed that
chgsign
was pretty slow without burst so I investigated more carefully. With the help of https://github.com/sschoener/unity-asm-explorer-package from @sschoener, I was finally able to see what was going on.chgsign
makes use ofasfloat
andasuint
to manipulate the sign bit directly but mono (without burst) struggles with these methods because of the use ofIntFloatUnion
which forces it to initialize all fields to zero prior to doing any work. It also forces excessive stack traffic and conversion from single -> double -> single. Here is an example ofasfloat(uint4)
before this change:With this change, mono generates much better code:
Some select perf numbers with my MacBook Pro that has an Intel(R) Core(TM) i9-9880H CPU (times are in microseconds and the burst times tend to be noisier due to its fast perf):