Closed skmp closed 3 weeks ago
Did a first pass on the dispatch optimizations, see skmp/reduce-dispatch-overhead
and skmp/optihacks-4
.
Overall, ByteMARK perf dropped, though I'm not 100% certain on the numbers yet. The implementation is mostly full, but not optimal, so it generates slightly larger code per OP_EXITFUNCTION.
Further optimizations on that branch:
UT2004 and FTL show ~ 20% perf win, especially in more complex scenes
update Looks like there is a general performance regression, even on native bytemark. OS issue?
Implemented multiple entry points in skmp/multiple-entry-points
(on top of optihacks-4).
Perf results are mixed, with bytemark being slightly slower, FTL gaining a few FPS in complex points, and metro being noticeably slower.
This is likely because of (far) larger code gen making the L1i issues worse. We'll likely need to hide this behind an option.
emfloat/fpemulation benchmark hits a pathological case of cmovcc/setcc. Fixing it tripled perf
Added block sorting on the frontend to avoid out of order jumps in the backend.
FEX
NUMERIC SORT : 581.03 : 14.90 : 4.89
STRING SORT : 140.48 : 62.77 : 9.72
BITFIELD : 5.1261e+08 : 87.93 : 18.37
FP EMULATION : 173.49 : 83.25 : 19.21
FOURIER : 15360 : 17.47 : 9.81
ASSIGNMENT : 26.391 : 100.42 : 26.05
IDEA : 2837.5 : 43.40 : 12.89
HUFFMAN : 1348.7 : 37.40 : 11.94
NEURAL NET : 17.319 : 27.82 : 11.70
LU DECOMPOSITION : 554.5 : 28.73 : 20.74
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 52.613
FLOATING-POINT INDEX: 24.078
qemu
NUMERIC SORT : 498.19 : 12.78 : 4.20
STRING SORT : 128.88 : 57.59 : 8.91
BITFIELD : 2.9591e+08 : 50.76 : 10.60
FP EMULATION : 193.78 : 92.99 : 21.46
FOURIER : 3620 : 4.12 : 2.31
ASSIGNMENT : 16.914 : 64.36 : 16.69
IDEA : 2253.9 : 34.47 : 10.24
HUFFMAN : 1074.7 : 29.80 : 9.52
NEURAL NET : 3.8008 : 6.11 : 2.57
LU DECOMPOSITION : 128.52 : 6.66 : 4.81
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 41.976
FLOATING-POINT INDEX: 5.511
native
NUMERIC SORT : 1555.8 : 39.90 : 13.10
STRING SORT : 460.91 : 205.95 : 31.88
BITFIELD : 5.7356e+08 : 98.39 : 20.55
FP EMULATION : 660.29 : 316.84 : 73.11
FOURIER : 89283 : 101.54 : 57.03
ASSIGNMENT : 59.728 : 227.28 : 58.95
IDEA : 10575 : 161.74 : 48.02
HUFFMAN : 3933.6 : 109.08 : 34.83
NEURAL NET : 79.738 : 128.09 : 53.88
LU DECOMPOSITION : 2179.2 : 112.89 : 81.52
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 139.480
FLOATING-POINT INDEX: 113.656
Add --enable-unsafe-pass=<pass name>
to allow unsafe optimization passes to be enabled from arguments.
Allow multiple of these (Look at how -E is implemented) and pass it to FEXCore for the PassManager to pick up and conditionally enable these patches.
Will be important for per-game optimization passes
skmp/optihacks-4
With first SRA impl (skmp/optihacks-5
)
--------------------:------------------:-------------:------------
NUMERIC SORT : 793.81 : 20.36 : 6.69
STRING SORT : 168.68 : 75.37 : 11.67
BITFIELD : 4.6885e+08 : 80.42 : 16.80
FP EMULATION : 217.03 : 104.14 : 24.03
FOURIER : 16576 : 18.85 : 10.59
ASSIGNMENT : 31.49 : 119.83 : 31.08
IDEA : 3568.4 : 54.58 : 16.20
HUFFMAN : 1681.4 : 46.62 : 14.89
NEURAL NET : 19.84 : 31.87 : 13.41
LU DECOMPOSITION : 555.38 : 28.77 : 20.78
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 62.953
FLOATING-POINT INDEX: 25.856
SRA + some mov elim + properly cooled laptop
--------------------:------------------:-------------:------------
NUMERIC SORT : 1107.1 : 28.39 : 9.32
STRING SORT : 191.97 : 85.78 : 13.28
BITFIELD : 5.8174e+08 : 99.79 : 20.84
FP EMULATION : 281.38 : 135.02 : 31.16
FOURIER : 18921 : 21.52 : 12.09
ASSIGNMENT : 43.512 : 165.57 : 42.95
IDEA : 3967.7 : 60.68 : 18.02
HUFFMAN : 1810.1 : 50.20 : 16.03
NEURAL NET : 25.056 : 40.25 : 16.93
LU DECOMPOSITION : 673.78 : 34.91 : 25.20
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 77.339
FLOATING-POINT INDEX: 31.151
SRA + full width mov elim + cooled laptop
--------------------:------------------:-------------:------------
NUMERIC SORT : 1205.9 : 30.93 : 10.16
STRING SORT : 208.36 : 93.10 : 14.41
BITFIELD : 5.8431e+08 : 100.23 : 20.94
FP EMULATION : 300.39 : 144.14 : 33.26
FOURIER : 19606 : 22.30 : 12.52
ASSIGNMENT : 45.535 : 173.27 : 44.94
IDEA : 4304.7 : 65.84 : 19.55
HUFFMAN : 1947.8 : 54.01 : 17.25
NEURAL NET : 25.579 : 41.09 : 17.28
LU DECOMPOSITION : 671.91 : 34.81 : 25.14
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 82.326
FLOATING-POINT INDEX: 31.711
libpng decode mainloop translation example -- codegen is starting to look quite optimal in some cases
str x4, [x24, #48]
mov x4, #0x3f51 // #16209
ldr x5, [x28, #88]
str w4, [x5, #8]
b 0xffefde33a2e8
ldur w4, [x26, #-60]
sub x21, x27, x4
b 0xffefde33a9a8
mov x27, x20
mov x21, x25
ldrb w4, [x21]
ldr w5, [x28, #96]
sub w5, w5, #0x3
str x5, [x28, #96]
add x25, x21, #0x3
strb w4, [x27]
ldrb w4, [x21, #1]
strb w4, [x27, #1]
ldrb w4, [x21, #2]
str x4, [x28, #104]
add x20, x27, #0x3
sturb w4, [x20, #-1]
cmp w5, #0x2
b.hi 0xffefde33a9a0 // b.pmore
ldr w4, [x28, #96]
cbz w4, 0xffefde33aa70
ldrb w20, [x21, #3]
ldr w4, [x28, #96]
strb w20, [x27, #3]
cmp w4, #0x2
b.ne 0xffefde33ae64 // b.any
ldrb w21, [x21, #4]
With experiemtal SRA16 (uses 6 caller saved regs)
--------------------:------------------:-------------:------------
NUMERIC SORT : 1342.4 : 34.43 : 11.31
STRING SORT : 205.16 : 91.67 : 14.19
BITFIELD : 5.8536e+08 : 100.41 : 20.97
FP EMULATION : 289.51 : 138.92 : 32.06
FOURIER : 18247 : 20.75 : 11.66
ASSIGNMENT : 55.266 : 210.30 : 54.55
IDEA : 4801 : 73.43 : 21.80
HUFFMAN : 3047.4 : 84.51 : 26.99
NEURAL NET : 25.602 : 41.13 : 17.30
LU DECOMPOSITION : 674.25 : 34.93 : 25.22
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 92.386
FLOATING-POINT INDEX: 31.006
The slight drops in some cases are probably because of register pressure with only 9 temps available. I'll investigate further after getting it to pass FTL + UT2004.
With SRA16+16, some frontend improvements
--------------------:------------------:-------------:------------
NUMERIC SORT : 1387.8 : 35.59 : 11.69
STRING SORT : 264.09 : 118.00 : 18.26
BITFIELD : 5.8385e+08 : 100.15 : 20.92
FP EMULATION : 292.71 : 140.45 : 32.41
FOURIER : 32888 : 37.40 : 21.01
ASSIGNMENT : 59.544 : 226.58 : 58.77
IDEA : 4829 : 73.86 : 21.93
HUFFMAN : 3054.2 : 84.69 : 27.05
NEURAL NET : 30.42 : 48.87 : 20.56
LU DECOMPOSITION : 1044.3 : 54.10 : 39.07
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 97.495
FLOATING-POINT INDEX: 46.241
With (skmp/optihacks-6
)
--------------------:------------------:-------------:------------
NUMERIC SORT : 1359.2 : 34.86 : 11.45
STRING SORT : 305.83 : 136.65 : 21.15
BITFIELD : 5.8103e+08 : 99.67 : 20.82
FP EMULATION : 304.13 : 145.93 : 33.67
FOURIER : 37772 : 42.96 : 24.13
ASSIGNMENT : 57.844 : 220.11 : 57.09
IDEA : 5539.5 : 84.72 : 25.16
HUFFMAN : 3324.6 : 92.19 : 29.44
NEURAL NET : 53.144 : 85.37 : 35.91
LU DECOMPOSITION : 1978.6 : 102.50 : 74.01
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 102.529
FLOATING-POINT INDEX: 72.167
With (skmp/optihacks-6
)
--------------------:------------------:-------------:------------
NUMERIC SORT : 1407.5 : 36.10 : 11.85
STRING SORT : 358.54 : 160.20 : 24.80
BITFIELD : 5.8023e+08 : 99.53 : 20.79
FP EMULATION : 304.38 : 146.05 : 33.70
FOURIER : 42118 : 47.90 : 26.90
ASSIGNMENT : 58.167 : 221.34 : 57.41
IDEA : 5654.9 : 86.49 : 25.68
HUFFMAN : 3339.4 : 92.60 : 29.57
NEURAL NET : 58.697 : 94.29 : 39.66
LU DECOMPOSITION : 2153 : 111.54 : 80.54
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 105.863
FLOATING-POINT INDEX: 79.565
skmp/optihacks-6
2nd merge wave ~
Trying out projects to keep track of this -> https://github.com/FEX-Emu/FEX/projects/2 <-
<<ByteMark to 80% native>> Roadmap
Bytemark performance tracking: Graph
Based on the systemic profiling/perf work of the past 2 weeks, +
skmp/optihacks-1
,skmp/optihacks-2
andskmp/optihacks-3
+ some analysis on ByteMark today I think the following is a good game plan for this goal:Cleanup changes in
skmp/optihacks-3
.Lightweight guest branching & dispatch
Targets: Switch tables, indirect functions
Reduce Lookup overhead
Current paged lookup + alias check + validation check is clearly not optimal.
I suggest a 2-layer approach, with the first layer being a cache and the second a tree / paged tree.
For the first layer, I'd use 24 bit lookup + alias check + lazy allocate LUT.
Basic Structure
Indirect Code Lookup
Block linking
Blocks that are statically mapped should link to each other. I propose to use indirect branches to implement this, with the branch vectors being allocated near the block.
Basic Structure
Block ending for blocks that exit with CALL_DIRECT, JUMP_DIRECT
Block ending for blocks that exit with RET, CALL_INDIRECT, JUMP_INDIRECT
PC-recovery
In order to reduce overhead, no validation is done on the DIRECT forms - so the default case needs to handle that. We can recover the block from the ret address (that's why BLR is needed). Then we can link the block
Block link metadata
We need to keep lists of which blocks link to witch for block invalidations
Static Register Allocation
Allocate 8 or 16 GPRs statically, do RA for SSA values on the rest regs. Make sure to support "lifetime sharing" when an SSA should share the host register with a guest reg as long as it is valid, and generate movs as needed.
Multiblock
Multiple Entry Points
Right only the main entry point is exported to the cache. Big blocks that call other blocks should export secondary entry points at the expected return points, to avoid multiple partial code compilations of the same function
PHI nodes (possibly not needed to meet goals)
We need the RA to support PHI nodes
MB-DCLSE (possibly not needed to meet goals)
We need Dead Context Load Store Elim to generate PHI nodes
Address important pathological code gen
Shuffles are one example, and there might be a few more important cases for ByteMARK
@Sonicadvance1 @phire thoughts?