Context switching performance

2dav commented 1 year ago

Hey, playing with coroutines and libfringe, it turned out that a vital part of context switching performance lies in pop+jmp vs ret. This comment on HN sheds some light

jmp rax (or any register really) uses the indirect jump prediction, while ret uses a special dedicated predictor that uses a stack (the Return Stack Buffer or RSB), populated by call instructions, to predict the return address. In the coroutine case, the ret does not jump to the address of the last call so it will be mispredicted many time.

This is still a thing with the modern CPUs, zen3 at least. Changing two lines on ~master: https://github.com/Xudong-Huang/generator-rs/blob/5888dac74f498a3c176f7b23c2260822a3928f48/src/detail/asm/asm_x86_64_sysv_elf_gas.S#L43

pop %rax
jmp *%rax

`x86_64 zen3`	name	before ns/iter	after ns/iter	diff ns/iter	diff %
scoped_yield_bench	22	9	-13	-59.09%	x 2.44
single_yield_bench	25	10	-15	-60.00%	x 2.50
single_yield_with_bench	22	9	-13	-59.09%	x 2.44

`perf` output for one of the benches		before	after
page-faults	779.883	6457	/sec
stalled-cycles-frontend	12.93%	0.08%	frontend cycles idle
stalled-cycles-backend	1.30%	48.56%	backend cycles idle
instructions	1.32	3.01	insn per cycle
branches	1.247	2.754	G/sec
branch-misses	6.61%	0.03%	of all branches

I don't have other hardware at hand right now, but can test this on Macbook M1 this week.

Xudong-Huang commented 1 year ago

Great! Thanks for the findings!

Xudong-Huang commented 1 year ago

I don't have aarch64 platform, may need help from other people. could be something like below to replace ret instruction

ldp x2, x1, [sp], #16 
br  x1

2dav commented 1 year ago

`Apple M1 2020`	name	before ns/iter	after ns/iter	diff ns/iter	diff %
scoped_yield_bench	22	11	-11	-50.00%	x 2.00
single_yield_bench	23	12	-11	-47.83%	x 1.92
single_yield_with_bench	23	11	-12	-52.17%	x 2.09

I'm not familiar with the ARM assembly, but from a quick googling it seems that ret on ARM doesn't pop the return address off the stack, but reads it from LR(x30) register which is already populated at the return point, so the required change is
https://github.com/Xudong-Huang/generator-rs/blob/4fcd32462842e748a10640e2437f5d0d0a275e70/src/detail/asm/asm_aarch64_aapcs_macho_gas.S#L51

br x30

this passes all of the tests and bench workload.

Xudong-Huang / generator-rs

Context switching performance #39