Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Multiple calls to llvm.prefetch.p0i8 on aarch64 can cause apparently-unnecessary register spills #50141

Open Quuxplusone opened 3 years ago

Quuxplusone commented 3 years ago
Bugzilla Link PR51172
Status NEW
Importance P enhancement
Reported by Steven Johnson (srj@google.com)
Reported on 2021-07-22 15:03:46 -0700
Last modified on 2021-09-01 12:32:59 -0700
Version trunk
Hardware PC All
CC alina.sbirlea@gmail.com, bradley.smith@arm.com, dsharlet@google.com, htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org, maskray@google.com, Paul.Walker@arm.com, resistor@mac.com, sjoerd.meijer@arm.com, t.p.northover@gmail.com
Fixed by commit(s)
Attachments prefetch.ll (98694 bytes, text/plain)
prefetch_2.ll (30317 bytes, text/plain)
prfm-may-store.diff (2362 bytes, text/plain)
Blocks
Blocked by
See also

Steps to repeat:

(1) See enclosed file prefetch.ll -- notice that there are several calls to @llvm.prefetch.p0i8, with all but the first commented out

(2) Compile to aarch64 assembly with llc -march=aarch64 ~/prefetch.ll -o - -O3 -mattr=+dotprod > ~/prefetch.s

(3) Examine output, search for prfm, see block similar to

    add x27, x26, x16
    prfm    pldl1strm, [x27]
    ldr q28, [x27]
    ldp q8, q29, [x12, #-32]
    ldr q31, [x28, x16]
    ldr q10, [x29, x16]
    ldr q11, [x30, x16]
    ldr q12, [x8, x16]
    ldp q9, q30, [x18, #-32]
    udot    v26.4s, v8.16b, v28.4b[0]
    udot    v24.4s, v8.16b, v31.4b[0]
    udot    v22.4s, v8.16b, v10.4b[0]
    udot    v20.4s, v8.16b, v11.4b[0]
    udot    v18.4s, v8.16b, v12.4b[0]

(4) Edit prefetch.ll to uncomment the call to llvm.prefetch.p0i8 on line 459, re-run llc

(5) Search for prfm again, and note that the start of the block now contains numerous vector-register spills that appear to be completely unnecessary:

    add x27, x26, x16
    prfm    pldl1strm, [x27]
    ldp q31, q30, [x18]
    ldr q0, [x12]
    ldr q1, [x27]
    str q30, [sp, #496]                 // 16-byte Folded Spill
    ldr q30, [x21]
    str q0, [sp, #400]                  // 16-byte Folded Spill
    ldr q0, [x12, #16]
    add x27, x28, x16
    stp q31, q30, [sp, #416]            // 32-byte Folded Spill
    ldr q30, [x21, #16]
    ldp q11, q29, [x12, #-32]
    str q0, [sp, #512]                  // 16-byte Folded Spill
    ldp q10, q0, [x18, #-32]
    str q30, [sp, #480]                 // 16-byte Folded Spill
    ldp q30, q31, [x14, #-32]
    ldp q8, q15, [x21, #-32]
    udot    v17.4s, v10.16b, v1.4b[0]
    udot    v17.4s, v0.16b, v1.4b[1]
    str q31, [sp, #384]                 // 16-byte Folded Spill
    ldr q31, [x14]
    udot    v16.4s, v30.16b, v1.4b[0]
    udot    v26.4s, v11.16b, v1.4b[0]
    udot    v26.4s, v29.16b, v1.4b[1]
    str q31, [sp, #448]                 // 16-byte Folded Spill
    ldr q31, [x14, #16]
    udot    v27.4s, v8.16b, v1.4b[0]
    udot    v27.4s, v15.16b, v1.4b[1]
    subs    x20, x20, #1                    // =1
    str q31, [sp, #464]                 // 16-byte Folded Spill
    prfm    pldl1strm, [x27]
    ldr q9, [x27]
    ldr q31, [x29, x16]
    ldr q12, [x30, x16]
    ldr q13, [x8, x16]
    udot    v2.4s, v10.16b, v9.4b[0]
    udot    v3.4s, v10.16b, v31.4b[0]
    udot    v14.4s, v10.16b, v12.4b[0]
    udot    v4.4s, v10.16b, v13.4b[0]
    udot    v2.4s, v0.16b, v9.4b[1]
    udot    v3.4s, v0.16b, v31.4b[1]
    udot    v14.4s, v0.16b, v12.4b[1]
    udot    v4.4s, v0.16b, v13.4b[1]

These spills don't seem to make any sense: the only instruction that should have been added here was the second prfm instruction, and it doesn't depend on any of the vector registers being spilled and reloaded. Is something about prefetch affecting this (e.g., confusing the lifetime analysis for registers loaded from the prefetch location)?

Quuxplusone commented 3 years ago

Attached prefetch.ll (98694 bytes, text/plain): prefetch.ll

Quuxplusone commented 3 years ago

Hi all, is there any chance we could get some help with this bug? It would really help us out to unblock some further experiments if we could fix or work around this. I am hoping that maybe the fix is something like telling LLVM that prefetch doesn't affect vector registers? If that is something that makes sense, could someone point us in the right direction?

Quuxplusone commented 3 years ago

+arm folks who may be better suited to redirect this.

Quuxplusone commented 3 years ago

Pinging this a couple of weeks later to see if anyone cc'ed on this bug can help point us in the right direction (or point us at someone who could).

Quuxplusone commented 3 years ago

I had a very quick look a couple of weeks back. The attached test-case is still pretty big. You might have more luck drumming up support if you can reduce it to something simpler.

Though it's quite possible reducing it is just as difficult as solving it. If you've tried and failed it'd be worth mentioning, might get some sympathy.

Quuxplusone commented 3 years ago

Attached prefetch_2.ll (30317 bytes, text/plain): Smaller repro case

Quuxplusone commented 3 years ago

Sorry, I didn't realize the size of the example was the issue. I've enclosed one that is about half the size of the first one. Instructions are the same as before:

(1) llc -march=aarch64 ~/prefetch_2.ll -o - -O3 -mattr=+dotprod > ~/prefetch_2.s (2) Note that the inner loop (for convolved.s1.rdom$x) has a single prfm and no spills (3) Uncomment the call to llvm.prefetch.p0i8 on line 206 of prefetch_2.ll, rerun llc (4) Note that in addition to a second prfm, there are also several spills/reloads in the inner loop now, but none seem to be related to the new prfm instruction.

Quuxplusone commented 3 years ago
Looking over the disassembly of prefetch_2.ll in more detail, it makes even
less sense (see below); it appears that q0 is "spilled" several times in the
inner loop, but never reloaded (at least not in a way I can see).

.LBB0_2:                                // %"for convolved.s1.rdom$x"
                                        //   Parent Loop BB0_1 Depth=1
                                        // =>  This Inner Loop Header: Depth=2
    prfm    pldl1strm, [x4]
    ldp q11, q0, [x2]
    add x5, x2, x11
    ldp q13, q31, [x5]
    ldr q24, [x4]
    str q0, [sp, #16]                   // 16-byte Folded Spill
    ldr q0, [x2, #32]
    subs    x3, x3, #1
    udot    v23.4s, v11.16b, v24.4b[0]
    udot    v21.4s, v13.16b, v24.4b[0]
    str q0, [sp, #32]                   // 16-byte Folded Spill
    ldr q0, [x2, #48]
    udot    v21.4s, v31.16b, v24.4b[1]
    add x2, x2, x13
    str q0, [sp, #48]                   // 16-byte Folded Spill
    ldp q29, q0, [x5, #32]
    add x5, x5, x11
    ldp q26, q10, [x5]
    ldp q8, q30, [x5, #32]
    add x5, x5, x11
    ldp q25, q15, [x5]
    ldp q12, q9, [x5, #32]
    add x5, x4, x12
    str q0, [sp]                        // 16-byte Folded Spill
    prfm    pldl1strm, [x5]
    ldr q14, [x5]
    ldr q0, [x4, x16]
    ldr q27, [x4, x17]
    udot    v20.4s, v25.16b, v24.4b[0]
    udot    v7.4s, v25.16b, v14.4b[0]
    udot    v2.4s, v25.16b, v0.4b[0]
    udot    v3.4s, v25.16b, v27.4b[0]
    ldr q25, [sp, #16]                  // 16-byte Folded Reload
    udot    v17.4s, v11.16b, v14.4b[0]
    udot    v16.4s, v11.16b, v0.4b[0]
    udot    v4.4s, v11.16b, v27.4b[0]
    udot    v23.4s, v25.16b, v24.4b[1]
    udot    v17.4s, v25.16b, v14.4b[1]
    udot    v16.4s, v25.16b, v0.4b[1]
    udot    v4.4s, v25.16b, v27.4b[1]
    ldr q25, [sp, #32]                  // 16-byte Folded Reload
    udot    v22.4s, v26.16b, v24.4b[0]
    udot    v19.4s, v26.16b, v14.4b[0]
    udot    v18.4s, v26.16b, v0.4b[0]
    udot    v5.4s, v26.16b, v27.4b[0]
    udot    v23.4s, v25.16b, v24.4b[2]
    udot    v17.4s, v25.16b, v14.4b[2]
    udot    v16.4s, v25.16b, v0.4b[2]
    udot    v4.4s, v25.16b, v27.4b[2]
    ldr q25, [sp, #48]                  // 16-byte Folded Reload
    ldr q26, [sp]                       // 16-byte Folded Reload
    udot    v6.4s, v13.16b, v14.4b[0]
    udot    v28.4s, v13.16b, v0.4b[0]
    udot    v1.4s, v13.16b, v27.4b[0]
    udot    v6.4s, v31.16b, v14.4b[1]
    udot    v28.4s, v31.16b, v0.4b[1]
    udot    v1.4s, v31.16b, v27.4b[1]
    udot    v22.4s, v10.16b, v24.4b[1]
    udot    v19.4s, v10.16b, v14.4b[1]
    udot    v18.4s, v10.16b, v0.4b[1]
    udot    v5.4s, v10.16b, v27.4b[1]
    udot    v20.4s, v15.16b, v24.4b[1]
    udot    v7.4s, v15.16b, v14.4b[1]
    udot    v2.4s, v15.16b, v0.4b[1]
    udot    v3.4s, v15.16b, v27.4b[1]
    udot    v21.4s, v29.16b, v24.4b[2]
    udot    v6.4s, v29.16b, v14.4b[2]
    udot    v28.4s, v29.16b, v0.4b[2]
    udot    v1.4s, v29.16b, v27.4b[2]
    udot    v22.4s, v8.16b, v24.4b[2]
    udot    v19.4s, v8.16b, v14.4b[2]
    udot    v18.4s, v8.16b, v0.4b[2]
    udot    v5.4s, v8.16b, v27.4b[2]
    udot    v20.4s, v12.16b, v24.4b[2]
    udot    v7.4s, v12.16b, v14.4b[2]
    udot    v2.4s, v12.16b, v0.4b[2]
    udot    v3.4s, v12.16b, v27.4b[2]
    udot    v23.4s, v25.16b, v24.4b[3]
    udot    v21.4s, v26.16b, v24.4b[3]
    udot    v22.4s, v30.16b, v24.4b[3]
    udot    v20.4s, v9.16b, v24.4b[3]
    udot    v17.4s, v25.16b, v14.4b[3]
    udot    v6.4s, v26.16b, v14.4b[3]
    udot    v19.4s, v30.16b, v14.4b[3]
    udot    v7.4s, v9.16b, v14.4b[3]
    udot    v16.4s, v25.16b, v0.4b[3]
    udot    v28.4s, v26.16b, v0.4b[3]
    udot    v18.4s, v30.16b, v0.4b[3]
    udot    v2.4s, v9.16b, v0.4b[3]
    udot    v4.4s, v25.16b, v27.4b[3]
    udot    v1.4s, v26.16b, v27.4b[3]
    udot    v5.4s, v30.16b, v27.4b[3]
    udot    v3.4s, v9.16b, v27.4b[3]
    mov x4, x5
    b.ne    .LBB0_2
Quuxplusone commented 3 years ago

After more debugging, I'm now wondering if this may not be a bug at all, but a side-effect of instruction scheduling.

The prfm instruction is modeled as "has side effects" (which is true and necessary, otherwise it would get optimized away entirely), which constrains the optimizer from being aggressive about moving around surrounding instructions.

I think what may be happening here is that in the no-prefetch case (or one-prefetch-at-top-of-loop), the machine-instruction scheduler is able to shuffle around instructions effectively, whereas the addition of the extra prfm splits it into two basic blocks, so we end up with suboptimal register allocation across them, either out of necessity (literally out of registers) or just the fact that optimal register allocation is hard.

One workaround I'm looking at is pushing all of the prfm instructions to the top of the inner loop, in the hopes that the rest of the loop can avoid spills; this isn't ideal, of course, as interleaving the prefetch instructions would be preferable for fairly obvious reasons.

Despite my suspicions, I'm going to keep this issue open for the time being, as I'm still not 100% convinced that the situation can't be improved.

Quuxplusone commented 3 years ago

That does sound plausible. I suppose a "proper" fix would be to add hasPerfSideEffects (a.k.a. please don't delete me) as another level intrinsics might sit at.

Quuxplusone commented 3 years ago

Would it be viable to mark prefetches as mayLoad rather than hasSideEffects?

Quuxplusone commented 3 years ago

I did wonder that, but my bet is a mayLoad that doesn't actually produce any values is fair game for dead removal. We could change that edge-case of course.

Quuxplusone commented 3 years ago

I wouldn't bet on that. At the codegen layer, I don't think we can remove dead loads because we might be changing trap behavior.

Quuxplusone commented 3 years ago

mayStore might work as well.

Quuxplusone commented 3 years ago

Thanks for the tips, I'll try out mayLoad and/or mayStore and see how the results look.

Quuxplusone commented 3 years ago

Attached prfm-may-store.diff (2362 bytes, text/plain): Patch to mark prfm as mayStore instead of hasSideEffects

Quuxplusone commented 3 years ago
FYI: I made an experimental change to Halide's codegen so that all our calls to
the Prefetch primitive in LLVM get hoisted to the top of the enclosing loop
(rather than interleaved with other instructions in the loop).

The good news: as I suspected, this eliminates all the spills I see before, and
I end up with nice clean aarch64 assembly that looks something like the code
below.

The not-good news: unsurprisingly, pushing the prefetch calls further away from
the instructions that actually need the prefetch result dramatically degrades
any benefit the prefetch would give us. (This varies some depending on the type
of ARM core running the code; e.g. on an A55 core, I saw essentially no speed
difference from with vs without prefetches.)

Because of the limited performance improvement, I don't think this approach is
likely to be worth pursuing as a long-term fix, unfortunately.

I'd still love to find a way to insert prefetch calls in an interleaved fashion
without injecting unnecessary spills, but it's not clear to me how doable this
is without nontrivial effort.

(snippet:)

    .LBB0_103:            // %"for convolved.s1.r19$z.r124.us.us.us"
    add x4, x18, x30
    add x13, x11, x30
    add x6, x14, x30
    add x15, x8, x30
    add x26, x28, x30
    prfm    pldl1strm, [x4, #16]
    prfm    pldl1strm, [x13, #16]
    prfm    pldl1strm, [x6, #16]
    prfm    pldl1strm, [x15, #16]
    prfm    pldl1strm, [x26, #16]
    ldr q7, [x4]
    ldp q19, q16, [x3, #-32]
    ldr q18, [x13]
    ldr q2, [x6]
    ldr q4, [x15]
    ldr q0, [x26]
    ldp q1, q17, [x19, #-32]
    udot    v12.4s, v19.16b, v7.4b[0]
    udot    v8.4s, v19.16b, v18.4b[0]
    udot    v28.4s, v19.16b, v2.4b[0]
    udot    v24.4s, v19.16b, v4.4b[0]
    udot    v20.4s, v19.16b, v0.4b[0]
    ldp q3, q19, [x24, #-32]
    udot    v13.4s, v1.16b, v7.4b[0]
    udot    v9.4s, v1.16b, v18.4b[0]
    udot    v29.4s, v1.16b, v2.4b[0]
    udot    v14.4s, v3.16b, v7.4b[0]
    udot    v10.4s, v3.16b, v18.4b[0]
    udot    v30.4s, v3.16b, v2.4b[0]
    udot    v25.4s, v1.16b, v4.4b[0]
    udot    v21.4s, v1.16b, v0.4b[0]
    udot    v26.4s, v3.16b, v4.4b[0]
    udot    v22.4s, v3.16b, v0.4b[0]
        ...etc...
Quuxplusone commented 3 years ago
There are probably 2 things going on here: how the prefetch intrinsic effects
codegen as has been discussed so far, and the potential performance uplift.

Software prefetching is probably completely counter productive if the next
instruction performs the load:

    prfm    pldl1strm, [x27]
    ldr q28, [x27]

Modern cores and prefetchers should hopefully predict this access, making the
prefetch instruction unnecessary, possibly even hurting performance. On smaller
cores, with less sophisticated prefetchers, there might be a use case for this
though. This is all very hand waivy though (and unfortunately not described in
e.g. the Cortex software optimisation guides), but basically what I want to say
is that it is worth experimenting if you actually need this and helps your case.
Quuxplusone commented 3 years ago

Thanks for pointing that out, that was a bug that we've hopefully since fixed. When troubleshooting performance, the first thing that stood out was the stack spills, I didn't get as far as noticing that the prefetch and load were from the same address.