Closed adamsitnik closed 3 years ago
@CarolEidt please help look into this.
CC @dotnet/jit-contrib
@adamsitnik - could you clarify how to run these on one of the Ubuntu arm64 systems? The command you give above appears to be using windows pathnames, and if I try this:
python scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'PerfLabTests.CastingPerf2.CastingPerf.IntObj'
I get a syntax error on line 39
Assuming you have the right builds of 3.1 and 5.0 installed and dotnet
is on your path, and you have cloned the perf repo (to say ~/repos/performance, then
cd ~/repos/performance/src/benchmarks/micro
dotnet run -c Release -f net5.0 -- -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'PerfLabTests.CastingPerf2.CastingPerf.IntObj'
should work. Add -d
to get the BDN produced disassembly.
Thanks @AndyAyersMS !
Also this may be touching on the relatively new cast caching, so cc @VSadov.
If this is a simple unbox, it should be jit-inlined. Even if not inlined it should not hit cast cache, since unbox requires exact type match*, but it could be calling a managed helper now and there is a small penalty due to tiering/R2R indirection. It is not a lot, but may be noticeable on super fast casts.
I will take a look.
*enums match with underlying types as well and thus have some special handling, but that still does not need to use cache.
could you clarify how to run these on one of the Ubuntu arm64 systems? The command you give above appears to be using windows pathname
please excuse me for that, I must have copy-pasted it from previous Windows specifc issue. I've fixed the description
I get a syntax error on line 39
If you append 3 to python
it should use python 3.x and work:
python3 scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'PerfLabTests.CastingPerf2.CastingPerf.IntObj'
Add
-d
to get the BDN produced disassembly.
Unfortunately, the BDN disassembler does not support ARM. Some time ago we have switched to use Iced
library and it does not support ARM yet (https://github.com/0xd4d/iced/issues/79, https://github.com/0xd4d/iced/issues/80)
Looking at the code generation, the only difference is that in 5.0 several address constants have been CSE'd. The loop goes from 108 bytes down to 64 bytes, but in both cases is small enough that alignment could make a big difference. Unless there's objection I would suggest we close this, and reference #8108 once again. (Perhaps we should add support for a specified loop alignment, even if only for use in benchmarking to validate or invalidate hypotheses such as this.)
BEFORE:
G_M590_IG03: ; offs=000028H, size=003CH, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
IN0008: 000028 movz x1, #0xd1ffab1e
IN0009: 00002C movk x1, #0xd1ffab1e LSL #16
IN000a: 000030 movk x1, #0xd1ffab1e LSL #32
IN000b: 000034 ldr x20, [x1]
IN000c: 000038 ldr x1, [x20]
IN000d: 00003C movz x0, #0xd1ffab1e
IN000e: 000040 movk x0, #0xd1ffab1e LSL #16
IN000f: 000044 movk x0, #0xd1ffab1e LSL #32
IN0010: 000048 cmp x1, x0
IN0011: 00004C beq G_M590_IG04
IN0012: 000050 mov x1, x20
IN0013: 000054 movz x0, #0xd1ffab1e
IN0014: 000058 movk x0, #0xd1ffab1e LSL #16
IN0015: 00005C movk x0, #0xd1ffab1e LSL #32
IN0016: 000060 bl CORINFO_HELP_UNBOX
G_M590_IG04: ; offs=000064H, size=0030H, gcrefRegs=100000 {x20}, byrefRegs=0000 {}, byref, isz
IN0017: 000064 ldr w0, [x20,#8]
IN0018: 000068 movz x1, #0xd1ffab1e
IN0019: 00006C movk x1, #0xd1ffab1e LSL #16
IN001a: 000070 movk x1, #0xd1ffab1e LSL #32
IN001b: 000074 str w0, [x1]
IN001c: 000078 add w19, w19, #1
IN001d: 00007C movz x0, #0xd1ffab1e
IN001e: 000080 movk x0, #0xd1ffab1e LSL #16
IN001f: 000084 movk x0, #0xd1ffab1e LSL #32
IN0020: 000088 ldr w0, [x0]
IN0021: 00008C cmp w19, w0
IN0022: 000090 blt G_M590_IG03
AFTER:
G_M50398_IG03: ; offs=000038H, size=001CH, bbWeight=4 PerfScore 36.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
IN000b: 000038 movz x1, #0xd1ffab1e
IN000c: 00003C movk x1, #0xd1ffab1e LSL #16
IN000d: 000040 movk x1, #0xd1ffab1e LSL #32
IN000e: 000044 ldr x22, [x1]
IN000f: 000048 ldr x1, [x22]
IN0010: 00004C cmp x1, x21
IN0011: 000050 beq G_M50398_IG05
G_M50398_IG04: ; offs=000054H, size=000CH, bbWeight=1 PerfScore 2.00, gcrefRegs=400000 {x22}, byrefRegs=0000 {}, byref
IN0012: 000054 mov x1, x22
IN0013: 000058 mov x0, x21
IN0014: 00005C bl CORINFO_HELP_UNBOX
G_M50398_IG05: ; offs=000060H, size=0018H, bbWeight=4 PerfScore 36.00, gcrefRegs=400000 {x22}, byrefRegs=0000 {}, byref, isz
IN0015: 000060 ldr w0, [x22,#8]
IN0016: 000064 str w0, [x20,#8]
IN0017: 000068 add w19, w19, #1
IN0018: 00006C ldr w0, [x20]
IN0019: 000070 cmp w19, w0
IN001a: 000074 blt G_M50398_IG03
Did something change in the way CORINFO_HELP_UNBOX
is implemented? Do you have a differential profile?
The benchmark does a trivial unbox of an (object)1
to int
:
I think it should not call the helper and do the unbox completely inline. It could be indeed an effect of loop alignment on a highly sensitive benchmark.
CORINFO_HELP_UNBOX
did actually change. The helper has been moved to managed code.
It should not be called in simple cases though. It basically should handle cast failures (throw) or rare cases such as unboxing an enum to underlying type.
I can't find the comment just now, but I recall @sdmaclea saying that ARM64 didn't have the same kind of code alignment penalties that we see on xArch.
I wouldn't expect significant alignment penalties. Instructions are always 4 bytes and 4 byte aligned by definition. If a loop crosses a cache line or page boundary there might be some perf difference, but I wouldn't expect much.
The branch predictor could possible be affected by the hash of the branch PC, but given there are only two branches in the loop I wouldn't expect that to be the issue.
The unboxing seems more likely to be the issue.
I've not been able to figure out how to debug the benchmark as run by the perf harness, but I've extracted the benchmark method, and verified that the helper is never called. Unless somehow different code is generated, there must be something else going on.
Based on the results here it looks like there is some modality that seems likely to be microarchitectural. I note that the oscillation seemed to increase around the time of https://github.com/dotnet/runtime/pull/39096 which enabled the CSE of these large constants, though it doesn't seem to coincide precisely.
I would guess those branches are not well predicted. So minimizing branch mispredict recovery time would be important.
I am wondering if the uArch modality has to do with fetch group alignment... Like you first suggested...
I haven't looked at the source but looking at the disassembly, it looks to me like a lot of G_M50398_IG03
is const and could be hoisted out of the loop. Especially if the register alocator could free another preserved register for the dst of IN000f: 000048 ldr x1, [x22]
it looks to me like a lot of G_M50398_IG03 is const and could be hoisted out of the loop.
We improved the CSE'ing of constants in .NET 5, so the "AFTER" loop has only one large constant. It's not hoisted out of the loop because it is marked as being dependent on the class constructor and therefore not hoistable. This is the address of a class static, so it's unclear to me why it's not hoistable. @briansull @AndyAyersMS - can you enlighten me why such a constant would not be hoistable?
@TamarChristinaArm - can you shed any light on what might case the above "AFTER" loop to be slower than the "BEFORE" loop? (e.g. loop alignment, cache effects, etc.)?
I should also note that, when I extracted the benchmark method and added a timer using Stopwatch
the performance was quite basically the same between 3.1 and 5.0.
@TamarChristinaArm - can you shed any light on what might case the above "AFTER" loop to be slower than the "BEFORE" loop? (e.g. loop alignment, cache effects, etc.)?
@CarolEidt that to me looks like what you suspected being loop alignment. Most uArch will have alignment requirement for branch targets (for performance not correctness), newer Cortex-A cores generally prefer 32-byte alignments for branch targets, see for instance Neoverse-N1 optimization guide section 4.8 Branch instruction alignment
for some of the requirements. I believe your CIs run XGene? In GCC these are the alignment requirements we have for it.
What we've observed is that for small loops the alignment makes a big difference and while both loops have misaligned targets the second loop size is much smaller so would be more sensitive.
Thanks so much @TamarChristinaArm - I propose that we either close this or mark it "Future" and take it into consideration when/if we address #8108.
Thoughts @adamsitnik ?
I can try and replicate what I did for #2249 to confirm some of the issues we're seeing now are indeed alignment. In the mean time you might be able to use perf stat
to confirm what you're seeing are IPC issues and not additional instructions. See https://github.com/dotnet/runtime/issues/41741#issuecomment-686816889 for one example of this.
In the short run we should figure out the proper method entry alignment -- we can't do anything about internal alignments until we fix this. Then we can at least enable the (optional) loop alignment for arm (implement emitLoopAlign
and any supporting bits). Perhaps extending #2249 to arm64 is the simplest thing?
@TamarChristinaArm is there a recommended set of NOP sequences of varying length, or is there some other practice to ensure alignment?
@AndyAyersMS - it would be interesting to try implementing #2249 for arm64. We've marked this 6.0.0, though, as it seems that modifying alignment would require significant perf analysis to ensure that we understand the impact.
Here's the output from perf stat -e "branch-misses,cache-misses,cpu-cycles,instructions,stalled-cycles-frontend,stalled-cycles-backend"
(the first one is 5.0 and the second is 3.1):
Performance counter stats for '/home/robox/cteidt/performance/tools/dotnet/arm64/dotnet run --project /home/robox/cteidt/performance/src/benchmarks/micro/MicroBenchmarks.csproj --configuration Release --framework netcoreapp5.0 --no-restore --no-build -- --filter PerfLabTests.CastingPerf2.CastingPerf.IntObj --packages /home/robox/cteidt/performance/artifacts/packages --runtimes netcoreapp5.0 --cli /home/robox/cteidt/performance/tools/dotnet/arm64/dotnet':
927,236,914 branch-misses
435,682,900 cache-misses
90,726,403,850 cpu-cycles
59,721,463,264 instructions # 0.66 insn per cycle
# 0.67 stalled cycles per insn
40,144,534,076 stalled-cycles-frontend # 44.25% frontend cycles idle
26,028,533,828 stalled-cycles-backend # 28.69% backend cycles idle
36.046207216 seconds time elapsed
Performance counter stats for '/home/robox/cteidt/performance/tools/dotnet/arm64/dotnet run --project /home/robox/cteidt/performance/src/benchmarks/micro/MicroBenchmarks.csproj --configuration Release --framework netcoreapp5.0 --no-restore --no-build -- --filter PerfLabTests.CastingPerf2.CastingPerf.IntObj --packages /home/robox/cteidt/performance/artifacts/packages --runtimes netcoreapp3.1 --cli /home/robox/cteidt/performance/tools/dotnet/arm64/dotnet':
746,009,625 branch-misses
347,766,489 cache-misses
76,174,119,876 cpu-cycles
84,227,660,637 instructions # 1.11 insn per cycle
# 0.38 stalled cycles per insn
32,418,821,468 stalled-cycles-frontend # 42.56% frontend cycles idle
14,048,337,909 stalled-cycles-backend # 18.44% backend cycles idle
@TamarChristinaArm is there a recommended set of NOP sequences of varying length, or is there some other practice to ensure alignment?
No, we just emit multiple NOP
s. For the larger alignment constraints like 32
we only align it if it means adding less than 16 bytes of padding. For the smaller ones we generally always align it.
It appears the consensus here is the issue is loop alignment, for which we already have linked issues tracking the work, so I'm going to close this. If that's incorrect, then feel free to re-open with a clear note about what unique work/issue this will address.
On my Windows x64, when I tested this benchmark with my changes in https://github.com/dotnet/runtime/pull/44370, I did see this benchmark improved. | Faster | base/diff | Base Median (ns) | Diff Median (ns) | Modality |
---|---|---|---|---|---|
PerfLabTests.CastingPerf.IntObj | 1.04 | 368774.36 | 355509.29 |
On my Windows x64
@kunalspathak This particular regression was specific to ARM64 (not x64). Is there any chance you could check the ARM results?
On my Windows x64
@kunalspathak This particular regression was specific to ARM64 (not x64). Is there any chance you could check the ARM results?
Ah, that's true. Currently my loop alignment is for xarch, but once I get it for arm, I will verify this.
After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found out that
PerfLabTests.CastingPerf2.CastingPerf.IntObj
has regressed x2.It looks like these are ARM64 specific regressions, I was not able to reproduce it for ARM (the 32-bit variant).
Repro
cc @kunalspathak
category:cq theme:needs-triage skill-level:expert cost:large