Open filipnavara opened 2 months ago
Tagging subscribers to 'os-ios': @vitek-karas, @kotlarmilos, @ivanpovazan, @steveisok, @akoeplinger See info in area-owners.md if you want to be subscribed.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Regarding:
Aside for performance implications this alignment makes the size of NativeAOT code on iOS/macOS bigger due to unnecessary alignment.
Do you by any chance have a measure of perf/size implications caused by this?
Do you by any chance have a measure of perf/size implications caused by this?
Very rough ballpark estimate is 50 Kb for empty MAUI app. I'll actually need to implement the change to measure something more specific. I noticed it when measuring unrelated change where 8-byte changes in prolog/epilog size resulted in 32-byte changes in object file.
Not sure how to easily benchmark it without running the whole dotnet/performance test suite and looking for improvements/regressions. Do we have some automated runs on Apple Silicon or does it have to be done manually?
Tested on dotnet new maui
targeting ios-arm64. Removing the method alignment results in -24.6 Kb. Removing the actual loop alignment inside JIT gets additional 2.4 Kb but it triggers at least three different asserts in JIT.
Note that the default ILC method alignment is 16 bytes, so the impact would be bigger for <OptimizationPreference>Size</OptimizationPreference>
where the default method alignment is 4 bytes.
With --Os
the size difference is -45.2 Kb with disabling the loop alignment and related code alignment.
Thanks a lot for checking it out! We should also run some perf benchmarks to see how much it impacts the speed. Regarding:
Not sure how to easily benchmark it without running the whole dotnet/performance test suite and looking for improvements/regressions. Do we have some automated runs on Apple Silicon or does it have to be done manually?
/cc: @matouskozak
Thanks a lot for checking it out! We should also run some perf benchmarks to see how much it impacts the speed. Regarding:
Not sure how to easily benchmark it without running the whole dotnet/performance test suite and looking for improvements/regressions. Do we have some automated runs on Apple Silicon or does it have to be done manually?
/cc: @matouskozak
I'm not aware of any automated runs on macOS apple silicon, only Windows and Linux x64/arm64 machines. If you have any specific scenarios that you expect to be affected, you can use the benchmarking script https://github.com/dotnet/performance/blob/main/scripts/BENCHMARKS_LOCAL_README.md which allows you to run specific microbenchmarks from the dotnet/performance repo locally.
@EgorBo do you know if CoreCLR perf is tested on Apple Silicon?
@EgorBo do you know if CoreCLR perf is tested on Apple Silicon?
Sadly, we don't test on macOS, so before making changes like this we should include macOS into perf testing first
It's also worth noting that this is going to end up being a microarchitecture
specific optimization. Most of the Arm optimization manuals (such as for Neoverse V1 or Cortex-A75) explicitly call out that branch instruction and branch target instruction alignment/density impacts performance. They typically recommend not placing more than four branch instructions within an aligned 32-byte window, etc.
So doing this may increase the testing complexity matrix and may change in the future. Notably, there's also complexity in that some architectures differentiate loop target alignment
from branch target alignment
and have different recommendations between the two of them.
If we did something here, it might be beneficial to ensure we get the right flexibility added so that things can be customized as appropriate.
Some microarchitectures benefit the most from no alignment, some from 16, some from 32, and even some of the latest from 64-byte alignment instead. For example, both the latest AMD and Intel optimization guides recommend using a 64-byte code window consideration, while somewhat older ones recommend 32-bytes instead (which is what the JIT currently does, but only considering loop starts), and "legacy cpus" recommend using 16-bytes. -- The root consideration here is not necessarily to align the loop, but rather to fit as much of the loop body into a single "instruction fetch and/or decode window" as possible. Hence the official guidance details considering aligning either the loop start of a hot loop to the beginning of the aligned boundary or the loop end to the end of an aligned boundary, with the latter often being preferred if you have to pick aligning one vs the other (something we don't even try to consider today). If a loop already fits entirely in the optimal window, then there can be much lesser benefit to doing any alignment and code size can be saved.
Moved to .NET 10 because we might not have time to work on it in .NET 9. Please move to .NET 9 if it needs to be resolved in .NET 9.
Should this be under native AOT <OptimizationPreference>speed</OptimizationPreference>
vs. <OptimizationPreference>Size</OptimizationPreference>
settings?
I think that the difference between Arm optimization manuals and Apple optimization manuals is more likely in the point of view, not necessarily in the micro-architecture differences. The Arm optimization manuals are more oriented towards achieving the best scores on industry CPU-intensive benchmarks. The Apple optimization manuals are more oriented towards achieving the best user experience on Apple devices and they might have determined that smaller code provides better experience on average even when it is not necessarily the fastest code in microbenchmarks.
…not necessarily in the micro-architecture differences.
I am not necessarily convinced that’s the case. Unfortunately I cannot quote the parts of the optimisation manual without breaking the Apple license. :-/ They seem to suggest the microarchitecture is less sensitive to the alignment and you may just benefit from feeding the instruction decoder with non-NOPs and from more actual code in the cache.
I will try to run some benchmarks when I get a spare moment.
Disabling loop alignment with OptimizationPreference=Size
, however, seems like a generally good idea to me. Also, the JitAlignLoops=0
option in JIT should likely disable the relevant method alignment flag as well.
From https://github.com/dotnet/runtime/issues/107284#issuecomment-2326271421 and https://github.com/dotnet/runtime/issues/107284#issuecomment-2326314236, what I understand is the size increase is more because of method alignment rather than loop alignment. Loop alignment is very conservative in deciding when it will add alignment and mostly will give-up based on various factors like loop size, padding needed, etc. As such, the best thing to do would be, if optimizing for Size
:
https://github.com/dotnet/runtime/pull/107340 makes the JIT change that enables us to get rid of the method alignment through a switch.
PR #59828 started enforcing 32-byte alignment for methods with loops on ARM64 based on the Neoverse N1 optimization guide:
https://github.com/dotnet/runtime/blob/16fe4d41ae95607f9214874ab7f22b3df5d8e561/src/coreclr/jit/emit.cpp#L6775-L6785
This goes contrary to the Apple Silicon CPU Optimization Guide, section 4.4.3 Branch Target Alignment. Apple specifically states that software alignment of branch targets is unnecessary and sometimes detrimental due to the alignment capabilities of the processor. The guidance is to not align branch targets and favor smaller code size.
Aside for performance implications this alignment makes the size of NativeAOT code on iOS/macOS bigger due to unnecessary alignment.