cartesi / machine-emulator

The off-chain implementation of the Cartesi Machine
GNU Lesser General Public License v3.0
58 stars 32 forks source link

Optimize instruction fetch and decoding #226

Open edubart opened 3 months ago

edubart commented 3 months ago

This a micro optimization at x86_64 assembly level of the instruction fetch+decode hot path. In summary this PR should save about 22 x86_64 instructions from every interpreter hot loop iteration. This optimization does not apply only to x86_64, but all architectures should benefit from.

Baseline

First I generated a hot trace of subsequent FENCE.I instruction calls. I choose this instruction because it is the most simple instruction, it basically does nothing, it's the ideal instruction to measure instruction fetch ovearhead. This was the trace for one iteration:

// mcycle check
| 0x7ffff7b88330 <interpret_loop+432>    add    $0x1,%r14                │ ++mcycle
│ 0x7ffff7b88334 <interpret_loop+436>    cmp    %r11,%r14                │ mcycle < mcycle_tick_end
│ 0x7ffff7b88337 <interpret_loop+439>    jae    0x7ffff7b88420           │ -> break interpret hot loop
// fetch
│ 0x7ffff7b8833d <interpret_loop+445>    mov    %r15,%rbx                │ pc
│ 0x7ffff7b88340 <interpret_loop+448>    and    $0xfffffffffffff000,%rbx │ vaddr_page = pc & ~PAGE_OFFSET_MASK
│ 0x7ffff7b88347 <interpret_loop+455>    cmp    %r12,%rbx                │ vaddr_page == fetch_vaddr_page
│ 0x7ffff7b8834a <interpret_loop+458>    jne    0x7ffff7b88728           │ -> miss fetch cache
│ 0x7ffff7b88350 <interpret_loop+464>    lea    0x0(%r13,%r15,1),%rax    │ hptr = pc + fetch_vh_offset
│ 0x7ffff7b88355 <interpret_loop+469>    mov    %r15,%rdx                │ pc
│ 0x7ffff7b88358 <interpret_loop+472>    not    %rdx                     │ ~pc
│ 0x7ffff7b8835b <interpret_loop+475>    test   $0xffe,%edx              │ ((~pc & PAGE_OFFSET_MASK) >> 1) == 0
│ 0x7ffff7b88361 <interpret_loop+481>    je     0x7ffff7b88760           │ -> cross page boundary
│ 0x7ffff7b88367 <interpret_loop+487>    mov    (%rax),%r9d              │ insn = *(uint32_t*)(hptr)
│ 0x7ffff7b8836a <interpret_loop+490>    mov    %rbx,%r12                │ fetch_vaddr_page = vaddr_page
// decoding: check if is a compressed instruction
│ 0x7ffff7b8836d <interpret_loop+493>    mov    %r9d,%eax                │ insn
│ 0x7ffff7b88370 <interpret_loop+496>    not    %eax                     │ ~insn
│ 0x7ffff7b88372 <interpret_loop+498>    test   $0x3,%al                 │ (~insn & 3) > 0
│ 0x7ffff7b88374 <interpret_loop+500>    jne    0x7ffff7b882b0           │ -> decode compressed instruction
// decoding: decode fence.i uncompressed instruction
│ 0x7ffff7b8837a <interpret_loop+506>    mov    %r9d,%eax                │
│ 0x7ffff7b8837d <interpret_loop+509>    and    $0x707f,%eax             │
│ 0x7ffff7b88382 <interpret_loop+514>    cmp    $0x3023,%eax             │
│ 0x7ffff7b88387 <interpret_loop+519>    je     0x7ffff7b8a868           │
│ 0x7ffff7b8838d <interpret_loop+525>    ja     0x7ffff7b88500           │
│ 0x7ffff7b88393 <interpret_loop+531>    cmp    $0x101b,%eax             |
│ 0x7ffff7b88398 <interpret_loop+536>    je     0x7ffff7b8a820           │
│ 0x7ffff7b8839e <interpret_loop+542>    ja     0x7ffff7b887d0           │
│ 0x7ffff7b883a4 <interpret_loop+548>    cmp    $0x3b,%eax               │
│ 0x7ffff7b883a7 <interpret_loop+551>    je     0x7ffff7b8a638           │
│ 0x7ffff7b883ad <interpret_loop+557>    ja     0x7ffff7b88b68           │
│ 0x7ffff7b88b68 <interpret_loop+2536>   cmp    $0x1003,%eax             │
│ 0x7ffff7b88b6d <interpret_loop+2541>   je     0x7ffff7b8a494           │
│ 0x7ffff7b88b73 <interpret_loop+2547>   ja     0x7ffff7b89450           │
| 0x7ffff7b89450 <interpret_loop+4816>   cmp    $0x1013,%eax             │
│ 0x7ffff7b89455 <interpret_loop+4821>   je     0x7ffff7b8a6d0           │
│ 0x7ffff7b8945b <interpret_loop+4827>   cmp    $0x1017,%eax             │
│ 0x7ffff7b89460 <interpret_loop+4832>   je     0x7ffff7b8a71c           │
│ 0x7ffff7b89466 <interpret_loop+4838>   cmp    $0x100f,%eax             │
│ 0x7ffff7b8946b <interpret_loop+4843>   jne    0x7ffff7b8ab45           │
// execute
│ 0x7ffff7b89471 <interpret_loop+4849>   add    $0x4,%r15                │ pc += 4
│ 0x7ffff7b89475 <interpret_loop+4853>   jmp    0x7ffff7b88330           │ -> jump to begin

This trace keeps looping in x86_64. We can see that in optimal conditions it takes exactly 40 x86_64 instructions to execute one FENCE.I in this trace, where:

I usually say that the cartesi machine is about 30~40 times slower than native, if we think about the ratio 40:1 in this trace, this is very close to what I usually say. If we can get this trace to execute with fewer x86_64 instruction, we can also get the cartesi machine interpreter to be faster for all instructions (not only this one).

If we look closely in the fetch, there are these two branches:

│ 0x7ffff7b8834a <interpret_loop+458>    jne    0x7ffff7b88728           │ -> miss fetch cache
│ 0x7ffff7b88361 <interpret_loop+481>    je     0x7ffff7b88760           │ -> cross page boundary

My idea was to come up with a single branch that could test both conditions in the fetch loop, simplifying to just one branch, so I could save some instructions.

This is the benchmark for baseline.

$ lua bench-insns.lua
RISC-V Privileged Memory-management             4.205 MIPS   3250.0 ucycles
RISC-V Privileged Interrupt-management        495.199 MIPS    257.0 ucycles
RV64I - Base integer instruction set          574.503 MIPS    263.8 ucycles
RV64M - Integer multiplication and division   575.371 MIPS    312.2 ucycles
RV64A - Atomic instructions                   431.893 MIPS    304.2 ucycles
RV64F - Single-precision floating-point       218.556 MIPS    489.3 ucycles
RV64D - Double-precision floating-point       214.964 MIPS   1773.7 ucycles
RV64Zicsr - Control and status registers      289.061 MIPS    328.3 ucycles
RV64Zicntr - Base counters and timers         343.063 MIPS    284.3 ucycles
RV64Zifence - Instruction fetch fence         792.319 MIPS    246.0 ucycles

$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
  Time (mean ± σ):      1.333 s ±  0.007 s    [User: 1.326 s, System: 0.006 s]
  Range (min … max):    1.321 s …  1.356 s    32 runs

Round 1 - Optimize fetch

After some thinking I come up with the changes presented in the PR to optimize instruction fetch, which generates the following new trace:

// mcycle check
|   0x7ffff7b873e0 <interpret_loop+432>    add    $0x1,%r15               │ ++mcycle
│   0x7ffff7b873e4 <interpret_loop+436>    cmp    %r10,%r15               │ mcycle < mcycle_tick_end
│   0x7ffff7b873e7 <interpret_loop+439>    jae    0x7ffff7b874b0          │ -> break interpret hot loop
// fetch
│   0x7ffff7b873ed <interpret_loop+445>    mov    %rbp,%rax               │ pc
│   0x7ffff7b873f0 <interpret_loop+448>    xor    %r14,%rax               │ pc ^ fetch_vaddr_page
│   0x7ffff7b873f3 <interpret_loop+451>    cmp    $0xffd,%rax             │ (pc ^ fetch_vaddr_page) < (PMA_PAGE_SIZE - 2)
│   0x7ffff7b873f9 <interpret_loop+457>    ja     0x7ffff7b877d8          │ -> miss fetch cache
│   0x7ffff7b873ff <interpret_loop+463>    mov    0x0(%rbp,%r13,1),%ebx   │ insn = *(uint32_t*)(pc + fetch_vh_offset)
// decoding: check if is a compressed instruction
│   0x7ffff7b87404 <interpret_loop+468>    mov    %ebx,%eax               │ insn
│   0x7ffff7b87406 <interpret_loop+470>    not    %eax                    │ ~insn
│   0x7ffff7b87408 <interpret_loop+472>    test   $0x3,%al                │ (~insn & 3) > 0
│   0x7ffff7b8740a <interpret_loop+474>    jne    0x7ffff7b87360          │ -> decode compressed instruction
// decoding: decode fence.i uncompressed instruction
│   0x7ffff7b87410 <interpret_loop+480>    mov    %ebx,%eax               │
│   0x7ffff7b87412 <interpret_loop+482>    and    $0x707f,%eax            │
│   0x7ffff7b87417 <interpret_loop+487>    cmp    $0x3023,%eax            │
│   0x7ffff7b8741c <interpret_loop+492>    je     0x7ffff7b891e8          │
│   0x7ffff7b87422 <interpret_loop+498>    ja     0x7ffff7b87590          │
│   0x7ffff7b87428 <interpret_loop+504>    cmp    $0x101b,%eax            │
│   0x7ffff7b8742d <interpret_loop+509>    je     0x7ffff7b899f4          │
│   0x7ffff7b87433 <interpret_loop+515>    ja     0x7ffff7b87838          │
│   0x7ffff7b87439 <interpret_loop+521>    cmp    $0x3b,%eax              │
│   0x7ffff7b8743c <interpret_loop+524>    je     0x7ffff7b892b9          │
│   0x7ffff7b87442 <interpret_loop+530>    ja     0x7ffff7b87bc0          │
│   0x7ffff7b87bc0 <interpret_loop+2448>   cmp    $0x1003,%eax            │
│   0x7ffff7b87bc5 <interpret_loop+2453>   je     0x7ffff7b89030          │
│   0x7ffff7b87bcb <interpret_loop+2459>   ja     0x7ffff7b88680          │
│   0x7ffff7b88680 <interpret_loop+5200>   cmp    $0x1013,%eax            │
│   0x7ffff7b88685 <interpret_loop+5205>   je     0x7ffff7b897b0          │
│   0x7ffff7b8868b <interpret_loop+5211>   cmp    $0x1017,%eax            │
│   0x7ffff7b88690 <interpret_loop+5216>   je     0x7ffff7b89317          │
│   0x7ffff7b88696 <interpret_loop+5222>   cmp    $0x100f,%eax            │
│   0x7ffff7b8869b <interpret_loop+5227>   jne    0x7ffff7b89b7a          │
// execute
│   0x7ffff7b886a1 <interpret_loop+5233>   add    $0x4,%rbp               | pc += 4
│   0x7ffff7b886a5 <interpret_loop+5237>   jmp    0x7ffff7b873e0          │ -> jump to begin

We can see that in optimal conditions it takes exactly 34 x86_64 instructions to execute one FENCE.I in this trace, where:

So in summary 6 instructions were optimized out from the very hot path. Tese are the new numbers for benchmarks:

$ lua bench-insns.lua
RISC-V Privileged Memory-management             4.227 MIPS   3244.0 ucycles
RISC-V Privileged Interrupt-management        509.389 MIPS    257.0 ucycles
RV64I - Base integer instruction set          611.176 MIPS    264.3 ucycles
RV64M - Integer multiplication and division   606.949 MIPS    312.9 ucycles
RV64A - Atomic instructions                   449.197 MIPS    304.6 ucycles
RV64F - Single-precision floating-point       227.302 MIPS    489.7 ucycles
RV64D - Double-precision floating-point       225.613 MIPS   1774.1 ucycles
RV64Zicsr - Control and status registers      300.756 MIPS    329.0 ucycles
RV64Zicntr - Base counters and timers         354.946 MIPS    283.3 ucycles
RV64Zifence - Instruction fetch fence         847.884 MIPS    246.0 ucycles

$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
  Time (mean ± σ):      1.269 s ±  0.011 s    [User: 1.261 s, System: 0.007 s]
  Range (min … max):    1.257 s …  1.317 s    32 runs

We can see improvements in all benchmarks, where:

Round 2 - Optimize decoding for uncompressed instruction

The decoding is using 24 instructions of the 34 instructions in the trace, this is about 70%! It's dominating the hot loop trace, imagine if we cut it in a half, maybe we could with jump tables.

EDIT: I decided to give a try to optimize the decoding code in a way so the GCC compiler can optimize it to jump tables. After some thinking and research I added a new commit to this PR, and this is the new trace:

// mcycle check
|   0x7ffff7b82990 <interpret_loop+560>  add    $0x1,%r14          │ ++mcycle
│   0x7ffff7b82994 <interpret_loop+564>  cmp    %r11,%r14          │ mcycle < mcycle_tick_end
│   0x7ffff7b82997 <interpret_loop+567>  jb     0x7ffff7b828a0     │ -> break interpret hot loop
// fetch
|   0x7ffff7b828a0 <interpret_loop+320>  mov    %r15,%rax          │ pc
│   0x7ffff7b828a3 <interpret_loop+323>  xor    %r13,%rax          │ pc ^ fetch_vaddr_page
│   0x7ffff7b828a6 <interpret_loop+326>  cmp    $0xffd,%rax        │ (pc ^ fetch_vaddr_page) < (PMA_PAGE_SIZE - 2)
│   0x7ffff7b828ac <interpret_loop+332>  ja     0x7ffff7b844f0     │ -> miss fetch cache
│   0x7ffff7b828b2 <interpret_loop+338>  mov    (%r15,%r12,1),%ebx │ insn = *(uint32_t*)(pc + fetch_vh_offset)
// decoding: check if is a compressed instruction
│   0x7ffff7b828b6 <interpret_loop+342>  mov    %ebx,%ecx          │ insn
│   0x7ffff7b828b8 <interpret_loop+344>  and    $0x3,%ecx          │ ~insn
│   0x7ffff7b828bb <interpret_loop+347>  cmp    $0x3,%ecx          │ (~insn & 3) > 0
│   0x7ffff7b828be <interpret_loop+350>  je     0x7ffff7b83100     │ -> decode compressed instruction
// decoding: decode fence.i uncompressed instruction
│   0x7ffff7b83100 <interpret_loop+2464> mov    %ebx,%eax          │ insn
│   0x7ffff7b83102 <interpret_loop+2466> mov    %ebx,%edx          │ insn
│   0x7ffff7b83104 <interpret_loop+2468> shr    $0x5,%eax          │ insn >> 5
│   0x7ffff7b83107 <interpret_loop+2471> and    $0x7f,%edx         │ insn & 0b1111111
│   0x7ffff7b8310a <interpret_loop+2474> and    $0x380,%eax        │ (insn >> 5) & 0b1110000000
│   0x7ffff7b8310f <interpret_loop+2479> or     %edx,%eax          │ ((insn >> 5) & 0b1110000000) | (insn & 0b1111111)
│   0x7ffff7b83111 <interpret_loop+2481> lea    -0x3(%rax),%edx    │ load index in jump table
│   0x7ffff7b83114 <interpret_loop+2484> cmp    $0x3f0,%edx        │ check if index is valid
│   0x7ffff7b8311a <interpret_loop+2490> ja     0x7ffff7b83130     │ -> illegal instruction
│   0x7ffff7b8311c <interpret_loop+2492> lea    0x3b711(%rip),%rdi │ load jump base offset
│   0x7ffff7b83123 <interpret_loop+2499> movslq (%rdi,%rdx,4),%rdx │ load jump offset for given index
│   0x7ffff7b83127 <interpret_loop+2503> add    %rdi,%rdx          │ compute instruction jump address
│   0x7ffff7b8312a <interpret_loop+2506> jmp    *%rdx              │ -> jump to instruction
// execute
│   0x7ffff7b83590 <interpret_loop+3632> add    $0x4,%r15          │ pc += 4
│  >0x7ffff7b83594 <interpret_loop+3636> jmp    0x7ffff7b82990     │ -> jump to begin

We can see that it takes exactly 27 x86_64 instructions to execute one FENCE.I in this trace! Where:

However this adds one memory indirection to lookup the jump table, but this is fine, this memory indirection is mostly likely cached in L1 CPU cache.

These are the new benchmark numbers:

$ lua bench-insns.lua
-- Average instruction set speed --
RISC-V Privileged Memory-management             4.215 MIPS   3234.0 ucycles
RISC-V Privileged Interrupt-management        835.057 MIPS    257.0 ucycles
RV64I - Base integer instruction set          780.803 MIPS    259.1 ucycles
RV64M - Integer multiplication and division   673.788 MIPS    306.8 ucycles
RV64A - Atomic instructions                   523.084 MIPS    298.6 ucycles
RV64F - Single-precision floating-point       347.813 MIPS    482.4 ucycles
RV64D - Double-precision floating-point       374.271 MIPS   1310.9 ucycles
RV64Zicsr - Control and status registers      321.237 MIPS    321.7 ucycles
RV64Zicntr - Base counters and timers         488.825 MIPS    279.3 ucycles
RV64Zifence - Instruction fetch fence        1475.150 MIPS    241.0 ucycles

$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
  Time (mean ± σ):      1.055 s ±  0.023 s    [User: 1.048 s, System: 0.006 s]
  Range (min … max):    1.041 s …  1.167 s    32 runs

Whoa that is:

Also some instructions are over 1GHz!

lui                                          1019.273 MIPS      248 ucycles
auipc                                        1014.344 MIPS      249 ucycles
beq                                          1009.463 MIPS      250 ucycles
bne                                          1021.258 MIPS      247 ucycles
blt                                          1021.258 MIPS      247 ucycles
bge                                          1018.283 MIPS      243 ucycles
bltu                                         1021.258 MIPS      244 ucycles
bgeu                                         1013.364 MIPS      248 ucycles

Round 3 - Single jump with computed gotos

Wasting 4 instructions every iteration just for checking if a instruction is compressed or not is not ideal, we could try to compile the compressed instruction switch and the uncompressed instruction switch into a single switch.

After trying to make a very large switch (2048 entries) GCC would refuse to use a large jump table, then I went making my own manual jump table, it ended up a large array with 2048 entries generated from a lua script, and used GCC's computed goto to make use of it. This is the new trace:

// mcycle check
│ 0x7ffff7b7e9d8 <interpret_loop+424>  add    $0x1,%r12             │ ++mcycle
│ 0x7ffff7b7e9dc <interpret_loop+428>  cmp    %r10,%r12             │ mcycle < mcycle_tick_end
│ 0x7ffff7b7e9df <interpret_loop+431>  jb     0x7ffff7b7e950        │ -> break interpret hot loop
// fetch
│ 0x7ffff7b7e950 <interpret_loop+288>  mov    %rbp,%rax             │ pc
│ 0x7ffff7b7e953 <interpret_loop+291>  xor    %r15,%rax             │ pc ^ fetch_vaddr_page
│ 0x7ffff7b7e956 <interpret_loop+294>  cmp    $0xffd,%rax           │ (pc ^ fetch_vaddr_page) < (PMA_PAGE_SIZE - 2)
│ 0x7ffff7b7e95c <interpret_loop+300>  ja     0x7ffff7b80d30        │ -> miss fetch cache
│ 0x7ffff7b7e962 <interpret_loop+306>  mov    0x0(%rbp,%r14,1),%ebx │ insn = *(uint32_t*)(pc + fetch_vh_offset)
// decode
│ 0x7ffff7b7e967 <interpret_loop+311>  mov    %ebx,%eax             │ insn
│ 0x7ffff7b7e969 <interpret_loop+313>  mov    %ebx,%edx             │ insn
│ 0x7ffff7b7e96b <interpret_loop+315>  lea    0x622ae(%rip),%rdi    │ compute jump table pointer
│ 0x7ffff7b7e972 <interpret_loop+322>  shr    $0x5,%eax             │ insn >> 5
│ 0x7ffff7b7e975 <interpret_loop+325>  and    $0x7f,%edx            │ insn & 0b1111111
│ 0x7ffff7b7e978 <interpret_loop+328>  and    $0x780,%eax           │ (insn >> 5) & 0b11110000000
│ 0x7ffff7b7e97d <interpret_loop+333>  or     %edx,%eax             │ ((insn >> 5) & 0b11110000000) | (insn & 0b1111111)
│ 0x7ffff7b7e97f <interpret_loop+335>  jmp    *(%rdi,%rax,8)        │ -> jump to instruction
// execute
│ 0x7ffff7b7fe55 <interpret_loop+5669> add    $0x4,%rbp             │ pc += 4
│ 0x7ffff7b7fe59 <interpret_loop+5673> jmp    0x7ffff7b7e9d8        │ -> jump to begin

We can see that it takes exactly 18 x86_64 instructions to execute one FENCE.I in this trace!!! Where:

So we went from 40 instruction from base line to 18 instructions, this should improve performance for all instructions, because all instructions always go to fetch and decoding.

Let's see the benchmarks:

$ lua bench-insns.lua
-- Average instruction set speed --
RISC-V Privileged Memory-management             4.267 MIPS   3203.0 ucycles
RISC-V Privileged Interrupt-management        968.451 MIPS    234.0 ucycles
RV64I - Base integer instruction set          895.744 MIPS    237.5 ucycles
RV64M - Integer multiplication and division   747.252 MIPS    286.2 ucycles
RV64A - Atomic instructions                   567.218 MIPS    278.4 ucycles
RV64F - Single-precision floating-point       331.036 MIPS    445.3 ucycles
RV64D - Double-precision floating-point       342.071 MIPS   1273.0 ucycles
RV64Zicsr - Control and status registers      352.496 MIPS    303.0 ucycles
RV64Zicntr - Base counters and timers         503.979 MIPS    263.3 ucycles
RV64Zifence - Instruction fetch fence        1949.502 MIPS    218.0 ucycles

$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
  Time (mean ± σ):     986.4 ms ±  26.5 ms    [User: 979.7 ms, System: 7.1 ms]
  Range (min … max):   968.3 ms … 1083.9 ms    32 runs

Whoa that is:

Also many instructions are over 1GHz speed:

lui                                          1314.326 MIPS      225 ucycles
auipc                                        1309.403 MIPS      226 ucycles
beq                                          1239.754 MIPS      228 ucycles
bne                                          1220.992 MIPS      228 ucycles
blt                                          1223.841 MIPS      228 ucycles
bge                                          1296.455 MIPS      228 ucycles
bltu                                         1306.142 MIPS      228 ucycles
bgeu                                         1311.040 MIPS      228 ucycles
addi                                         1155.101 MIPS      229 ucycles
addiw                                        1157.651 MIPS      229 ucycles
xori                                         1156.375 MIPS      229 ucycles
ori                                          1162.785 MIPS      229 ucycles
andi                                         1186.462 MIPS      229 ucycles
slli                                         1011.410 MIPS      231 ucycles
fence                                        1949.502 MIPS      218 ucycles
fence.i                                      1949.502 MIPS      218 ucycles

arm64 trace

I also made a trace for this PR on arm64, this is it:

// mcycle check
| 0xfffff7c1f860 <interpret_loop+352>  add  x24, x24, #0x1        │ ++mcycle
│ 0xfffff7c1f864 <interpret_loop+356>  cmp  x24, x26              │ mcycle < mcycle_tick_end
│ 0xfffff7c1f868 <interpret_loop+360>  b.cc 0xfffff7c1f818        │ -> break interpret hot loop
// fetch
│ 0xfffff7c1f818 <interpret_loop+280>  eor  x1, x20, x27          │ pc ^ fetch_vaddr_page
│ 0xfffff7c1f81c <interpret_loop+284>  cmp  x1, #0xffd            │ (pc ^ fetch_vaddr_page) < (PMA_PAGE_SIZE - 2)
│ 0xfffff7c1f820 <interpret_loop+288>  b.hi 0xfffff7c21474        │ -> miss fetch cache
│ 0xfffff7c1f824 <interpret_loop+292>  ldr  w19, [x20, x28]       │ insn = *(uint32_t*)(pc + fetch_vh_offset)
// decode
│ 0xfffff7c1f828 <interpret_loop+296>  and  w1, w19, #0x7f        │ insn & 0b1111111
│ 0xfffff7c1f82c <interpret_loop+300>  lsr  w3, w19, #5           │ (insn >> 5) & 0b11110000000
│ 0xfffff7c1f830 <interpret_loop+304>  and  w3, w3, #0x780        │ insn & 0b1111111
│ 0xfffff7c1f834 <interpret_loop+308>  orr  w3, w3, w1            │ ((insn >> 5) & 0b11110000000) | (insn & 0b1111111)
│ 0xfffff7c1f838 <interpret_loop+312>  ldr  x0, [x23, x3, lsl #3] │ compute jump table pointer
│ 0xfffff7c1f83c <interpret_loop+316>  br   x0                    │ -> jump to instruction
// execute
│ 0xfffff7c20dfc <interpret_loop+5884> add  x20, x20, #0x4        │ pc += 4
│ 0xfffff7c20e00 <interpret_loop+5888> b    0xfffff7c1f860        │ -> jump to begin

In short:

Looks like arm64 is more instruction efficient than x86_64.