[ARM64] Performance regression: PerfLabTests.CastingPerf2.CastingPerf.IntObj

adamsitnik commented 4 years ago

After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found out that PerfLabTests.CastingPerf2.CastingPerf.IntObj has regressed x2.

It looks like these are ARM64 specific regressions, I was not able to reproduce it for ARM (the 32-bit variant).

Repro

git clone https://github.com/dotnet/performance.git
py ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'PerfLabTests.CastingPerf2.CastingPerf.IntObj'

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 16.04
Unknown processor
  [Host]     : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT
  Job-PVNQZA : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT
  Job-PXIHWO : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT

Method	Toolchain	Mean	Ratio
IntObj	netcoreapp3.1	466.4 us	1.00
IntObj	netcoreapp5.0	1,001.1 us	2.15

cc @kunalspathak

category:cq theme:needs-triage skill-level:expert cost:large

JulieLeeMSFT commented 4 years ago

@CarolEidt please help look into this.

CC @dotnet/jit-contrib

CarolEidt commented 4 years ago

@adamsitnik - could you clarify how to run these on one of the Ubuntu arm64 systems? The command you give above appears to be using windows pathnames, and if I try this:

python scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'PerfLabTests.CastingPerf2.CastingPerf.IntObj'

I get a syntax error on line 39

AndyAyersMS commented 4 years ago

Assuming you have the right builds of 3.1 and 5.0 installed and dotnet is on your path, and you have cloned the perf repo (to say ~/repos/performance, then

cd ~/repos/performance/src/benchmarks/micro
dotnet run -c Release -f net5.0 -- -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'PerfLabTests.CastingPerf2.CastingPerf.IntObj'

should work. Add -d to get the BDN produced disassembly.

CarolEidt commented 4 years ago

Thanks @AndyAyersMS !

AndyAyersMS commented 4 years ago

Also this may be touching on the relatively new cast caching, so cc @VSadov.

VSadov commented 4 years ago

If this is a simple unbox, it should be jit-inlined. Even if not inlined it should not hit cast cache, since unbox requires exact type match*, but it could be calling a managed helper now and there is a small penalty due to tiering/R2R indirection. It is not a lot, but may be noticeable on super fast casts.

I will take a look.

*enums match with underlying types as well and thus have some special handling, but that still does not need to use cache.

adamsitnik commented 4 years ago

could you clarify how to run these on one of the Ubuntu arm64 systems? The command you give above appears to be using windows pathname

please excuse me for that, I must have copy-pasted it from previous Windows specifc issue. I've fixed the description

I get a syntax error on line 39

If you append 3 to python it should use python 3.x and work:

python3 scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'PerfLabTests.CastingPerf2.CastingPerf.IntObj'

Add -d to get the BDN produced disassembly.

Unfortunately, the BDN disassembler does not support ARM. Some time ago we have switched to use Iced library and it does not support ARM yet (https://github.com/0xd4d/iced/issues/79, https://github.com/0xd4d/iced/issues/80)

CarolEidt commented 4 years ago

Looking at the code generation, the only difference is that in 5.0 several address constants have been CSE'd. The loop goes from 108 bytes down to 64 bytes, but in both cases is small enough that alignment could make a big difference. Unless there's objection I would suggest we close this, and reference #8108 once again. (Perhaps we should add support for a specified loop alignment, even if only for use in benchmarking to validate or invalidate hypotheses such as this.)

BEFORE:

G_M590_IG03:        ; offs=000028H, size=003CH, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz

IN0008: 000028      movz    x1, #0xd1ffab1e
IN0009: 00002C      movk    x1, #0xd1ffab1e LSL #16
IN000a: 000030      movk    x1, #0xd1ffab1e LSL #32
IN000b: 000034      ldr     x20, [x1]
IN000c: 000038      ldr     x1, [x20]
IN000d: 00003C      movz    x0, #0xd1ffab1e
IN000e: 000040      movk    x0, #0xd1ffab1e LSL #16
IN000f: 000044      movk    x0, #0xd1ffab1e LSL #32
IN0010: 000048      cmp     x1, x0
IN0011: 00004C      beq     G_M590_IG04
IN0012: 000050      mov     x1, x20
IN0013: 000054      movz    x0, #0xd1ffab1e
IN0014: 000058      movk    x0, #0xd1ffab1e LSL #16
IN0015: 00005C      movk    x0, #0xd1ffab1e LSL #32
IN0016: 000060      bl      CORINFO_HELP_UNBOX

G_M590_IG04:        ; offs=000064H, size=0030H, gcrefRegs=100000 {x20}, byrefRegs=0000 {}, byref, isz

IN0017: 000064      ldr     w0, [x20,#8]
IN0018: 000068      movz    x1, #0xd1ffab1e
IN0019: 00006C      movk    x1, #0xd1ffab1e LSL #16
IN001a: 000070      movk    x1, #0xd1ffab1e LSL #32
IN001b: 000074      str     w0, [x1]
IN001c: 000078      add     w19, w19, #1
IN001d: 00007C      movz    x0, #0xd1ffab1e
IN001e: 000080      movk    x0, #0xd1ffab1e LSL #16
IN001f: 000084      movk    x0, #0xd1ffab1e LSL #32
IN0020: 000088      ldr     w0, [x0]
IN0021: 00008C      cmp     w19, w0
IN0022: 000090      blt     G_M590_IG03

AFTER:

G_M50398_IG03:        ; offs=000038H, size=001CH, bbWeight=4    PerfScore 36.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz

IN000b: 000038      movz    x1, #0xd1ffab1e
IN000c: 00003C      movk    x1, #0xd1ffab1e LSL #16
IN000d: 000040      movk    x1, #0xd1ffab1e LSL #32
IN000e: 000044      ldr     x22, [x1]
IN000f: 000048      ldr     x1, [x22]
IN0010: 00004C      cmp     x1, x21
IN0011: 000050      beq     G_M50398_IG05

G_M50398_IG04:        ; offs=000054H, size=000CH, bbWeight=1    PerfScore 2.00, gcrefRegs=400000 {x22}, byrefRegs=0000 {}, byref

IN0012: 000054      mov     x1, x22
IN0013: 000058      mov     x0, x21
IN0014: 00005C      bl      CORINFO_HELP_UNBOX

G_M50398_IG05:        ; offs=000060H, size=0018H, bbWeight=4    PerfScore 36.00, gcrefRegs=400000 {x22}, byrefRegs=0000 {}, byref, isz

IN0015: 000060      ldr     w0, [x22,#8]
IN0016: 000064      str     w0, [x20,#8]
IN0017: 000068      add     w19, w19, #1
IN0018: 00006C      ldr     w0, [x20]
IN0019: 000070      cmp     w19, w0
IN001a: 000074      blt     G_M50398_IG03

AndyAyersMS commented 4 years ago

Did something change in the way CORINFO_HELP_UNBOX is implemented? Do you have a differential profile?

VSadov commented 4 years ago

The benchmark does a trivial unbox of an (object)1 to int:

https://github.com/dotnet/performance/blob/8aed638c9ee65c034fe0cca4ea2bdc3a68d2a6b5/src/benchmarks/micro/runtime/perflab/CastingPerf2.cs#L186

I think it should not call the helper and do the unbox completely inline. It could be indeed an effect of loop alignment on a highly sensitive benchmark.

VSadov commented 4 years ago

CORINFO_HELP_UNBOX did actually change. The helper has been moved to managed code.

It should not be called in simple cases though. It basically should handle cast failures (throw) or rare cases such as unboxing an enum to underlying type.

AndyAyersMS commented 4 years ago

I can't find the comment just now, but I recall @sdmaclea saying that ARM64 didn't have the same kind of code alignment penalties that we see on xArch.

sdmaclea commented 4 years ago

I wouldn't expect significant alignment penalties. Instructions are always 4 bytes and 4 byte aligned by definition. If a loop crosses a cache line or page boundary there might be some perf difference, but I wouldn't expect much.

The branch predictor could possible be affected by the hash of the branch PC, but given there are only two branches in the loop I wouldn't expect that to be the issue.

The unboxing seems more likely to be the issue.

CarolEidt commented 4 years ago

I've not been able to figure out how to debug the benchmark as run by the perf harness, but I've extracted the benchmark method, and verified that the helper is never called. Unless somehow different code is generated, there must be something else going on.

CarolEidt commented 4 years ago

Based on the results here it looks like there is some modality that seems likely to be microarchitectural. I note that the oscillation seemed to increase around the time of https://github.com/dotnet/runtime/pull/39096 which enabled the CSE of these large constants, though it doesn't seem to coincide precisely.

sdmaclea commented 4 years ago

I would guess those branches are not well predicted. So minimizing branch mispredict recovery time would be important.

I am wondering if the uArch modality has to do with fetch group alignment... Like you first suggested...

sdmaclea commented 4 years ago

I haven't looked at the source but looking at the disassembly, it looks to me like a lot of G_M50398_IG03 is const and could be hoisted out of the loop. Especially if the register alocator could free another preserved register for the dst of IN000f: 000048 ldr x1, [x22]

CarolEidt commented 4 years ago

it looks to me like a lot of G_M50398_IG03 is const and could be hoisted out of the loop.

We improved the CSE'ing of constants in .NET 5, so the "AFTER" loop has only one large constant. It's not hoisted out of the loop because it is marked as being dependent on the class constructor and therefore not hoistable. This is the address of a class static, so it's unclear to me why it's not hoistable. @briansull @AndyAyersMS - can you enlighten me why such a constant would not be hoistable?

CarolEidt commented 4 years ago

@TamarChristinaArm - can you shed any light on what might case the above "AFTER" loop to be slower than the "BEFORE" loop? (e.g. loop alignment, cache effects, etc.)?

CarolEidt commented 4 years ago

I should also note that, when I extracted the benchmark method and added a timer using Stopwatch the performance was quite basically the same between 3.1 and 5.0.

TamarChristinaArm commented 4 years ago

@TamarChristinaArm - can you shed any light on what might case the above "AFTER" loop to be slower than the "BEFORE" loop? (e.g. loop alignment, cache effects, etc.)?

@CarolEidt that to me looks like what you suspected being loop alignment. Most uArch will have alignment requirement for branch targets (for performance not correctness), newer Cortex-A cores generally prefer 32-byte alignments for branch targets, see for instance Neoverse-N1 optimization guide section 4.8 Branch instruction alignment for some of the requirements. I believe your CIs run XGene? In GCC these are the alignment requirements we have for it.

What we've observed is that for small loops the alignment makes a big difference and while both loops have misaligned targets the second loop size is much smaller so would be more sensitive.

CarolEidt commented 4 years ago

Thanks so much @TamarChristinaArm - I propose that we either close this or mark it "Future" and take it into consideration when/if we address #8108.

Thoughts @adamsitnik ?

AndyAyersMS commented 4 years ago

I can try and replicate what I did for #2249 to confirm some of the issues we're seeing now are indeed alignment. In the mean time you might be able to use perf stat to confirm what you're seeing are IPC issues and not additional instructions. See https://github.com/dotnet/runtime/issues/41741#issuecomment-686816889 for one example of this.

In the short run we should figure out the proper method entry alignment -- we can't do anything about internal alignments until we fix this. Then we can at least enable the (optional) loop alignment for arm (implement emitLoopAlign and any supporting bits). Perhaps extending #2249 to arm64 is the simplest thing?

@TamarChristinaArm is there a recommended set of NOP sequences of varying length, or is there some other practice to ensure alignment?

CarolEidt commented 4 years ago

@AndyAyersMS - it would be interesting to try implementing #2249 for arm64. We've marked this 6.0.0, though, as it seems that modifying alignment would require significant perf analysis to ensure that we understand the impact.

CarolEidt commented 4 years ago

Here's the output from perf stat -e "branch-misses,cache-misses,cpu-cycles,instructions,stalled-cycles-frontend,stalled-cycles-backend" (the first one is 5.0 and the second is 3.1):

Performance counter stats for '/home/robox/cteidt/performance/tools/dotnet/arm64/dotnet run --project /home/robox/cteidt/performance/src/benchmarks/micro/MicroBenchmarks.csproj --configuration Release --framework netcoreapp5.0 --no-restore --no-build -- --filter PerfLabTests.CastingPerf2.CastingPerf.IntObj --packages /home/robox/cteidt/performance/artifacts/packages --runtimes netcoreapp5.0 --cli /home/robox/cteidt/performance/tools/dotnet/arm64/dotnet':

       927,236,914      branch-misses
       435,682,900      cache-misses
    90,726,403,850      cpu-cycles
    59,721,463,264      instructions              #    0.66  insn per cycle
                                                  #    0.67  stalled cycles per insn
    40,144,534,076      stalled-cycles-frontend   #   44.25% frontend cycles idle
    26,028,533,828      stalled-cycles-backend    #   28.69% backend cycles idle

      36.046207216 seconds time elapsed

Performance counter stats for '/home/robox/cteidt/performance/tools/dotnet/arm64/dotnet run --project /home/robox/cteidt/performance/src/benchmarks/micro/MicroBenchmarks.csproj --configuration Release --framework netcoreapp5.0 --no-restore --no-build -- --filter PerfLabTests.CastingPerf2.CastingPerf.IntObj --packages /home/robox/cteidt/performance/artifacts/packages --runtimes netcoreapp3.1 --cli /home/robox/cteidt/performance/tools/dotnet/arm64/dotnet':

       746,009,625      branch-misses
       347,766,489      cache-misses
    76,174,119,876      cpu-cycles
    84,227,660,637      instructions              #    1.11  insn per cycle
                                                  #    0.38  stalled cycles per insn
    32,418,821,468      stalled-cycles-frontend   #   42.56% frontend cycles idle
    14,048,337,909      stalled-cycles-backend    #   18.44% backend cycles idle

TamarChristinaArm commented 4 years ago

@TamarChristinaArm is there a recommended set of NOP sequences of varying length, or is there some other practice to ensure alignment?

No, we just emit multiple NOPs. For the larger alignment constraints like 32 we only align it if it means adding less than 16 bytes of padding. For the smaller ones we generally always align it.

BruceForstall commented 3 years ago

It appears the consensus here is the issue is loop alignment, for which we already have linked issues tracking the work, so I'm going to close this. If that's incorrect, then feel free to re-open with a clear note about what unique work/issue this will address.

kunalspathak commented 3 years ago

On my Windows x64, when I tested this benchmark with my changes in https://github.com/dotnet/runtime/pull/44370, I did see this benchmark improved.	Faster	base/diff	Base Median (ns)	Diff Median (ns)	Modality
PerfLabTests.CastingPerf.IntObj	1.04	368774.36	355509.29

adamsitnik commented 3 years ago

On my Windows x64

@kunalspathak This particular regression was specific to ARM64 (not x64). Is there any chance you could check the ARM results?

kunalspathak commented 3 years ago

On my Windows x64

@kunalspathak This particular regression was specific to ARM64 (not x64). Is there any chance you could check the ARM results?

Ah, that's true. Currently my loop alignment is for xarch, but once I get it for arm, I will verify this.

dotnet / runtime

[ARM64] Performance regression: PerfLabTests.CastingPerf2.CastingPerf.IntObj #41706

Repro