Open a74nh opened 1 week ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
You are right about the code produced when tiering disabled, however, if we run the tiering and let the JIT collect profile data, we make better decision of pushing out the cold block outside of loop.
If you see below, from profile, we find out that BB07
is less likely and push it towards bottom, reducing the extra unconditional branch.
Diffs comparison of FullOpts
vs. Tier 1
.
As for the compactness/reuse portion of loop, that is something we can investigate further.
when tiering disabled
Unfortunately, I've been noticing this and many similar block ordering issues too (with .NET 9 regressing some further where careful ordering of the code is no longer sufficient), and tiering does not apply to NativeAOT, which is an important target. I hope that the compiler gaps like these will not be relegated to DPGO cleaning them up.
Thanks
@dotnet/jit-contrib
Unfortunately, I've been noticing this and many similar block ordering issues too (with .NET 9 regressing some further where careful ordering of the code is no longer sufficient), and tiering does not apply to NativeAOT, which is an important target. I hope that the compiler gaps like these will not be relegated to DPGO cleaning them up.
If you have any examples handy, I'd like to take a look. We have plans to address block layout more aggressively in .NET 10 (#107749), and it would be nice to have candidates like the above to improve upon. Our block layout plans are quite reliant on profile data, though the JIT frontend has some tricks to determine which blocks are important in the absence of profile data (for example, see Compiler::optSetBlockWeights
, though this phase needs quite a bit of work, too).
If you have any examples handy, I'd like to take a look.
Don't have anything on hand right now but will keep an eye out next time I open Ghidra 🙂. Some of them look like this one: https://github.com/dotnet/runtime/issues/93536 except torn loops or "why does this even have jump threading" cases appear under different conditions now. In any case, thanks for looking into this.
You are right about the code produced when tiering disabled, however, if we run the tiering and let the JIT collect profile data, we make better decision of pushing out the cold block outside of loop.
This is good to see - I didn't realise profile data was being used this way.
However, this shouldn't need profile data to optimise.
We have plans to address block layout more aggressively in .NET 10 (#107749), and it would be nice to have candidates like the above to improve upon.
Ok, great. If I see anything else, I'll make sure to raise issues or comment here.
Consider:
All entries less than then
input[0]
go intoleft
, everything else goes intoright
. This is the main routine inside a quicksort.On Arm64 this runs at half the performance of the equivalent C++ version compiled by GCC. I suspect X86 has similar issues but I haven't tried yet.
The C# disassembly looks reasonable:
However the C++ version is a little more cunning:
Due to the block ordering there are fewer branches.
In coreclr, there are always two jumps per loop iteration.
In gcc, there are zero to two jumps per loop iteration.
This difference in block order has a huge impact on performance, halving the performance of the coreclr version