Xudong-Huang / generator-rs

rust stackful generator library
Apache License 2.0
303 stars 37 forks source link

Context switching performance #39

Closed 2dav closed 1 year ago

2dav commented 1 year ago

Hey, playing with coroutines and libfringe, it turned out that a vital part of context switching performance lies in pop+jmp vs ret. This comment on HN sheds some light

jmp rax (or any register really) uses the indirect jump prediction, while ret uses a special dedicated predictor that uses a stack (the Return Stack Buffer or RSB), populated by call instructions, to predict the return address. In the coroutine case, the ret does not jump to the address of the last call so it will be mispredicted many time.

This is still a thing with the modern CPUs, zen3 at least. Changing two lines on ~master: https://github.com/Xudong-Huang/generator-rs/blob/5888dac74f498a3c176f7b23c2260822a3928f48/src/detail/asm/asm_x86_64_sysv_elf_gas.S#L43

pop %rax
jmp *%rax
x86_64 zen3 name before ns/iter after ns/iter diff ns/iter diff % speedup
scoped_yield_bench 22 9 -13 -59.09% x 2.44
single_yield_bench 25 10 -15 -60.00% x 2.50
single_yield_with_bench 22 9 -13 -59.09% x 2.44
perf output for one of the benches before after
page-faults 779.883 6457 /sec
stalled-cycles-frontend 12.93% 0.08% frontend cycles idle
stalled-cycles-backend 1.30% 48.56% backend cycles idle
instructions 1.32 3.01 insn per cycle
branches 1.247 2.754 G/sec
branch-misses 6.61% 0.03% of all branches

I don't have other hardware at hand right now, but can test this on Macbook M1 this week.

Xudong-Huang commented 1 year ago

Great! Thanks for the findings!

Xudong-Huang commented 1 year ago

I don't have aarch64 platform, may need help from other people. could be something like below to replace ret instruction

ldp x2, x1, [sp], #16 
br  x1
2dav commented 1 year ago
Apple M1 2020 name before ns/iter after ns/iter diff ns/iter diff % speedup
scoped_yield_bench 22 11 -11 -50.00% x 2.00
single_yield_bench 23 12 -11 -47.83% x 1.92
single_yield_with_bench 23 11 -12 -52.17% x 2.09

I'm not familiar with the ARM assembly, but from a quick googling it seems that ret on ARM doesn't pop the return address off the stack, but reads it from LR(x30) register which is already populated at the return point, so the required change is
https://github.com/Xudong-Huang/generator-rs/blob/4fcd32462842e748a10640e2437f5d0d0a275e70/src/detail/asm/asm_aarch64_aapcs_macho_gas.S#L51

br x30

this passes all of the tests and bench workload.