dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.91k stars 4.64k forks source link

Do we still need to use REP RET for returns targeted/following after jumps? #84456

Open VSadov opened 1 year ago

VSadov commented 1 year ago

The REP RET was a workaround for some peculiarities of branch predictors in AMD Family 10h/12h. As of 15h (Bulldozer) the recommendation disappeared from the official optimization guides and for Zen families contain recommendation to not do that:

https://www.amd.com/en/support/tech-docs/software-optimization-guide-for-the-amd-zen4-microarchitecture

2.8.1.3.2 REP RET
For prior processor families, such as Family 10h and 12h, a three-byte return-immediate RET 
instruction had been recommended as an optimization to improve performance over a single-byte 
near-return. For the AMD Zen4 microarchitecture, this is no longer recommended and a single-byte 
near-return (opcode C3h) can be used with no negative performance impact. This will result in 
smaller code size over the three-byte method. For the rationale for the former recommendation, see 
section 6.2 in the Software Optimization Guide for AMD Family 10h and 12h Processors.

15h (Bulldozer) is a 12 years old architecture. Perhaps it is time to stop using the pattern?

We have a number of assembly helpers that use REPRET. I am not sure if JIT emits it too.

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak See info in area-owners.md if you want to be subscribed.

Issue Details
The REP RET was a workaround for some peculiarities of branch predictors in AMD Family 10h/12h. As of 15h (Bulldozer) the recommendation disappeared from the official optimization guides and for Zen families contain recommendation to not do that: ``` 2.8.1.3.2 REP RET For prior processor families, such as Family 10h and 12h, a three-byte return-immediate RET instruction had been recommended as an optimization to improve performance over a single-byte near-return. For the AMD Zen4 microarchitecture, this is no longer recommended and a single-byte near-return (opcode C3h) can be used with no negative performance impact. This will result in smaller code size over the three-byte method. For the rationale for the former recommendation, see section 6.2 in the Software Optimization Guide for AMD Family 10h and 12h Processors. ``` 15h (Bulldozer) is 12 years old architecture. Perhaps it is time to stop using that? We have a number of assembly helpers that use REPRET. I am not sure if JIT emits it too.
Author: VSadov
Assignees: -
Labels: `area-CodeGen-coreclr`
Milestone: -
VSadov commented 1 year ago

CC @tannergooding

JulieLeeMSFT commented 1 year ago

cc @dotnet/jit-contrib. We will not have time to address this in .NET 8. Pushing to Future.

tannergooding commented 1 year ago

@VSadov, I think its fine to no longer do an optimization that was only relevant to Phenom based processors.

I'd imagine fixing this should be trivial overall.

AndyAyersMS commented 10 months ago

I only see two appearances in code and one in SPMI's pattern matcher. AFAIK the jit does not emit these.

@VSadov are there more that you know of?

src/coreclr/tools/superpmi/superpmi/neardiffer.cpp:        if (u16_strcmp(instrMnemonic_1, L"rep ret") == 0)
src/coreclr/vm/amd64/jithelpers_singleappdomain.S:        rep ret
src/coreclr/vm/amd64/jithelpers_singleappdomain.S:        rep ret
VSadov commented 10 months ago

I am not sure if JIT emits this. It may be mostly relevant to various asm helpers like GC barriers. There is even a macro for this. Like here:

https://github.com/dotnet/runtime/blob/08903c00860939af0d291eb3f3c037a18d9820f6/src/coreclr/vm/amd64/jithelpers_fast.S#L163-L165

AndyAyersMS commented 10 months ago

Ok, I am going to switch this over to runtime.