Closed 2dav closed 1 year ago
Great! Thanks for the findings!
I don't have aarch64 platform, may need help from other people.
could be something like below to replace ret
instruction
ldp x2, x1, [sp], #16
br x1
Apple M1 2020 |
name | before ns/iter | after ns/iter | diff ns/iter | diff % | speedup |
---|---|---|---|---|---|---|
scoped_yield_bench | 22 | 11 | -11 | -50.00% | x 2.00 | |
single_yield_bench | 23 | 12 | -11 | -47.83% | x 1.92 | |
single_yield_with_bench | 23 | 11 | -12 | -52.17% | x 2.09 |
I'm not familiar with the ARM assembly, but from a quick googling it seems that ret
on ARM doesn't pop the return address off the stack, but reads it from LR(x30) register which is already populated at the return point, so the required change is
https://github.com/Xudong-Huang/generator-rs/blob/4fcd32462842e748a10640e2437f5d0d0a275e70/src/detail/asm/asm_aarch64_aapcs_macho_gas.S#L51
br x30
this passes all of the tests and bench workload.
Hey, playing with coroutines and
libfringe
, it turned out that a vital part of context switching performance lies inpop+jmp
vsret
. This comment on HN sheds some lightThis is still a thing with the modern CPUs, zen3 at least. Changing two lines on ~master: https://github.com/Xudong-Huang/generator-rs/blob/5888dac74f498a3c176f7b23c2260822a3928f48/src/detail/asm/asm_x86_64_sysv_elf_gas.S#L43
x86_64 zen3
perf
output for one of the benchesI don't have other hardware at hand right now, but can test this on Macbook M1 this week.