This a micro optimization at x86_64 assembly level of the instruction fetch+decode hot path. In summary this PR should save about 22 x86_64 instructions from every interpreter hot loop iteration. This optimization does not apply only to x86_64, but all architectures should benefit from.
Baseline
First I generated a hot trace of subsequent FENCE.I instruction calls. I choose this instruction because it is the most simple instruction, it basically does nothing, it's the ideal instruction to measure instruction fetch ovearhead. This was the trace for one iteration:
This trace keeps looping in x86_64. We can see that in optimal conditions it takes exactly 40 x86_64 instructions to execute one FENCE.I in this trace, where:
mcycle check: 3 instructions
fetch: 11 instructions
decoding: 24 instructions
execution: 2 instructions
I usually say that the cartesi machine is about 30~40 times slower than native, if we think about the ratio 40:1 in this trace, this is very close to what I usually say. If we can get this trace to execute with fewer x86_64 instruction, we can also get the cartesi machine interpreter to be faster for all instructions (not only this one).
If we look closely in the fetch, there are these two branches:
│ 0x7ffff7b8834a <interpret_loop+458> jne 0x7ffff7b88728 │ -> miss fetch cache
│ 0x7ffff7b88361 <interpret_loop+481> je 0x7ffff7b88760 │ -> cross page boundary
My idea was to come up with a single branch that could test both conditions in the fetch loop, simplifying to just one branch, so I could save some instructions.
This is the benchmark for baseline.
$ lua bench-insns.lua
RISC-V Privileged Memory-management 4.205 MIPS 3250.0 ucycles
RISC-V Privileged Interrupt-management 495.199 MIPS 257.0 ucycles
RV64I - Base integer instruction set 574.503 MIPS 263.8 ucycles
RV64M - Integer multiplication and division 575.371 MIPS 312.2 ucycles
RV64A - Atomic instructions 431.893 MIPS 304.2 ucycles
RV64F - Single-precision floating-point 218.556 MIPS 489.3 ucycles
RV64D - Double-precision floating-point 214.964 MIPS 1773.7 ucycles
RV64Zicsr - Control and status registers 289.061 MIPS 328.3 ucycles
RV64Zicntr - Base counters and timers 343.063 MIPS 284.3 ucycles
RV64Zifence - Instruction fetch fence 792.319 MIPS 246.0 ucycles
$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
Time (mean ± σ): 1.333 s ± 0.007 s [User: 1.326 s, System: 0.006 s]
Range (min … max): 1.321 s … 1.356 s 32 runs
Round 1 - Optimize fetch
After some thinking I come up with the changes presented in the PR to optimize instruction fetch, which generates the following new trace:
We can see that in optimal conditions it takes exactly 34 x86_64 instructions to execute one FENCE.I in this trace, where:
mcycle check: 3 instructions (same as before)
fetch: 5 instructions (-6 instructions)
decoding: 24 instructions (same as before)
execution: 2 instructions (same as before)
So in summary 6 instructions were optimized out from the very hot path. Tese are the new numbers for benchmarks:
$ lua bench-insns.lua
RISC-V Privileged Memory-management 4.227 MIPS 3244.0 ucycles
RISC-V Privileged Interrupt-management 509.389 MIPS 257.0 ucycles
RV64I - Base integer instruction set 611.176 MIPS 264.3 ucycles
RV64M - Integer multiplication and division 606.949 MIPS 312.9 ucycles
RV64A - Atomic instructions 449.197 MIPS 304.6 ucycles
RV64F - Single-precision floating-point 227.302 MIPS 489.7 ucycles
RV64D - Double-precision floating-point 225.613 MIPS 1774.1 ucycles
RV64Zicsr - Control and status registers 300.756 MIPS 329.0 ucycles
RV64Zicntr - Base counters and timers 354.946 MIPS 283.3 ucycles
RV64Zifence - Instruction fetch fence 847.884 MIPS 246.0 ucycles
$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
Time (mean ± σ): 1.269 s ± 0.011 s [User: 1.261 s, System: 0.007 s]
Range (min … max): 1.257 s … 1.317 s 32 runs
We can see improvements in all benchmarks, where:
Dhrystone took about 1.269 / 1.333 = 95% time to execute, minus 5%.
611.176 / 574.50 = ~ 6% improvement in instruction execution speed for RV64I instruction set
Round 2 - Optimize decoding for uncompressed instruction
The decoding is using 24 instructions of the 34 instructions in the trace, this is about 70%! It's dominating the hot loop trace, imagine if we cut it in a half, maybe we could with jump tables.
EDIT: I decided to give a try to optimize the decoding code in a way so the GCC compiler can optimize it to jump tables. After some thinking and research I added a new commit to this PR, and this is the new trace:
Wasting 4 instructions every iteration just for checking if a instruction is compressed or not is not ideal, we could try to compile the compressed instruction switch and the uncompressed instruction switch into a single switch.
After trying to make a very large switch (2048 entries) GCC would refuse to use a large jump table, then I went making my own manual jump table, it ended up a large array with 2048 entries generated from a lua script, and used GCC's computed goto to make use of it. This is the new trace:
We can see that it takes exactly 18 x86_64 instructions to execute one FENCE.I in this trace!!! Where:
mcycle check: 3 instructions (same as before)
fetch: 5 instructions (-6 instructions from baseline)
decoding: 8 instructions (-16 instructions from baseline)
execution: 2 instructions (same as before)
So we went from 40 instruction from base line to 18 instructions, this should improve performance for all instructions, because all instructions always go to fetch and decoding.
Let's see the benchmarks:
$ lua bench-insns.lua
-- Average instruction set speed --
RISC-V Privileged Memory-management 4.267 MIPS 3203.0 ucycles
RISC-V Privileged Interrupt-management 968.451 MIPS 234.0 ucycles
RV64I - Base integer instruction set 895.744 MIPS 237.5 ucycles
RV64M - Integer multiplication and division 747.252 MIPS 286.2 ucycles
RV64A - Atomic instructions 567.218 MIPS 278.4 ucycles
RV64F - Single-precision floating-point 331.036 MIPS 445.3 ucycles
RV64D - Double-precision floating-point 342.071 MIPS 1273.0 ucycles
RV64Zicsr - Control and status registers 352.496 MIPS 303.0 ucycles
RV64Zicntr - Base counters and timers 503.979 MIPS 263.3 ucycles
RV64Zifence - Instruction fetch fence 1949.502 MIPS 218.0 ucycles
$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
Time (mean ± σ): 986.4 ms ± 26.5 ms [User: 979.7 ms, System: 7.1 ms]
Range (min … max): 968.3 ms … 1083.9 ms 32 runs
This a micro optimization at x86_64 assembly level of the instruction fetch+decode hot path. In summary this PR should save about 22 x86_64 instructions from every interpreter hot loop iteration. This optimization does not apply only to x86_64, but all architectures should benefit from.
Baseline
First I generated a hot trace of subsequent
FENCE.I
instruction calls. I choose this instruction because it is the most simple instruction, it basically does nothing, it's the ideal instruction to measure instruction fetch ovearhead. This was the trace for one iteration:This trace keeps looping in x86_64. We can see that in optimal conditions it takes exactly 40 x86_64 instructions to execute one
FENCE.I
in this trace, where:I usually say that the cartesi machine is about 30~40 times slower than native, if we think about the ratio 40:1 in this trace, this is very close to what I usually say. If we can get this trace to execute with fewer x86_64 instruction, we can also get the cartesi machine interpreter to be faster for all instructions (not only this one).
If we look closely in the fetch, there are these two branches:
My idea was to come up with a single branch that could test both conditions in the fetch loop, simplifying to just one branch, so I could save some instructions.
This is the benchmark for baseline.
Round 1 - Optimize fetch
After some thinking I come up with the changes presented in the PR to optimize instruction fetch, which generates the following new trace:
We can see that in optimal conditions it takes exactly 34 x86_64 instructions to execute one
FENCE.I
in this trace, where:So in summary 6 instructions were optimized out from the very hot path. Tese are the new numbers for benchmarks:
We can see improvements in all benchmarks, where:
1.269 / 1.333 = 95%
time to execute, minus 5%.611.176 / 574.50 = ~ 6%
improvement in instruction execution speed for RV64I instruction setRound 2 - Optimize decoding for uncompressed instruction
The decoding is using 24 instructions of the 34 instructions in the trace, this is about 70%! It's dominating the hot loop trace, imagine if we cut it in a half, maybe we could with jump tables.
EDIT: I decided to give a try to optimize the decoding code in a way so the GCC compiler can optimize it to jump tables. After some thinking and research I added a new commit to this PR, and this is the new trace:
We can see that it takes exactly 27 x86_64 instructions to execute one FENCE.I in this trace! Where:
However this adds one memory indirection to lookup the jump table, but this is fine, this memory indirection is mostly likely cached in L1 CPU cache.
These are the new benchmark numbers:
Whoa that is:
FENCE.I
instruction is +86% fasterRV64I
instructions are +35% fasterAlso some instructions are over 1GHz!
Round 3 - Single jump with computed gotos
Wasting 4 instructions every iteration just for checking if a instruction is compressed or not is not ideal, we could try to compile the compressed instruction switch and the uncompressed instruction switch into a single switch.
After trying to make a very large switch (2048 entries) GCC would refuse to use a large jump table, then I went making my own manual jump table, it ended up a large array with 2048 entries generated from a lua script, and used GCC's computed goto to make use of it. This is the new trace:
We can see that it takes exactly 18 x86_64 instructions to execute one FENCE.I in this trace!!! Where:
So we went from 40 instruction from base line to 18 instructions, this should improve performance for all instructions, because all instructions always go to fetch and decoding.
Let's see the benchmarks:
Whoa that is:
FENCE.I
instruction is +146% fasterRV64I
instructions are +55% fasterAlso many instructions are over 1GHz speed:
arm64 trace
I also made a trace for this PR on arm64, this is it:
In short:
Looks like arm64 is more instruction efficient than x86_64.