Open Quuxplusone opened 3 years ago
Attached prefetch.ll
(98694 bytes, text/plain): prefetch.ll
Hi all, is there any chance we could get some help with this bug? It would really help us out to unblock some further experiments if we could fix or work around this. I am hoping that maybe the fix is something like telling LLVM that prefetch doesn't affect vector registers? If that is something that makes sense, could someone point us in the right direction?
+arm folks who may be better suited to redirect this.
Pinging this a couple of weeks later to see if anyone cc'ed on this bug can help point us in the right direction (or point us at someone who could).
I had a very quick look a couple of weeks back. The attached test-case is still pretty big. You might have more luck drumming up support if you can reduce it to something simpler.
Though it's quite possible reducing it is just as difficult as solving it. If you've tried and failed it'd be worth mentioning, might get some sympathy.
Attached prefetch_2.ll
(30317 bytes, text/plain): Smaller repro case
Sorry, I didn't realize the size of the example was the issue. I've enclosed one that is about half the size of the first one. Instructions are the same as before:
(1) llc -march=aarch64 ~/prefetch_2.ll -o - -O3 -mattr=+dotprod > ~/prefetch_2.s
(2) Note that the inner loop (for convolved.s1.rdom$x
) has a single prfm and no spills
(3) Uncomment the call to llvm.prefetch.p0i8
on line 206 of prefetch_2.ll, rerun llc
(4) Note that in addition to a second prfm, there are also several spills/reloads in the inner loop now, but none seem to be related to the new prfm instruction.
Looking over the disassembly of prefetch_2.ll in more detail, it makes even
less sense (see below); it appears that q0 is "spilled" several times in the
inner loop, but never reloaded (at least not in a way I can see).
.LBB0_2: // %"for convolved.s1.rdom$x"
// Parent Loop BB0_1 Depth=1
// => This Inner Loop Header: Depth=2
prfm pldl1strm, [x4]
ldp q11, q0, [x2]
add x5, x2, x11
ldp q13, q31, [x5]
ldr q24, [x4]
str q0, [sp, #16] // 16-byte Folded Spill
ldr q0, [x2, #32]
subs x3, x3, #1
udot v23.4s, v11.16b, v24.4b[0]
udot v21.4s, v13.16b, v24.4b[0]
str q0, [sp, #32] // 16-byte Folded Spill
ldr q0, [x2, #48]
udot v21.4s, v31.16b, v24.4b[1]
add x2, x2, x13
str q0, [sp, #48] // 16-byte Folded Spill
ldp q29, q0, [x5, #32]
add x5, x5, x11
ldp q26, q10, [x5]
ldp q8, q30, [x5, #32]
add x5, x5, x11
ldp q25, q15, [x5]
ldp q12, q9, [x5, #32]
add x5, x4, x12
str q0, [sp] // 16-byte Folded Spill
prfm pldl1strm, [x5]
ldr q14, [x5]
ldr q0, [x4, x16]
ldr q27, [x4, x17]
udot v20.4s, v25.16b, v24.4b[0]
udot v7.4s, v25.16b, v14.4b[0]
udot v2.4s, v25.16b, v0.4b[0]
udot v3.4s, v25.16b, v27.4b[0]
ldr q25, [sp, #16] // 16-byte Folded Reload
udot v17.4s, v11.16b, v14.4b[0]
udot v16.4s, v11.16b, v0.4b[0]
udot v4.4s, v11.16b, v27.4b[0]
udot v23.4s, v25.16b, v24.4b[1]
udot v17.4s, v25.16b, v14.4b[1]
udot v16.4s, v25.16b, v0.4b[1]
udot v4.4s, v25.16b, v27.4b[1]
ldr q25, [sp, #32] // 16-byte Folded Reload
udot v22.4s, v26.16b, v24.4b[0]
udot v19.4s, v26.16b, v14.4b[0]
udot v18.4s, v26.16b, v0.4b[0]
udot v5.4s, v26.16b, v27.4b[0]
udot v23.4s, v25.16b, v24.4b[2]
udot v17.4s, v25.16b, v14.4b[2]
udot v16.4s, v25.16b, v0.4b[2]
udot v4.4s, v25.16b, v27.4b[2]
ldr q25, [sp, #48] // 16-byte Folded Reload
ldr q26, [sp] // 16-byte Folded Reload
udot v6.4s, v13.16b, v14.4b[0]
udot v28.4s, v13.16b, v0.4b[0]
udot v1.4s, v13.16b, v27.4b[0]
udot v6.4s, v31.16b, v14.4b[1]
udot v28.4s, v31.16b, v0.4b[1]
udot v1.4s, v31.16b, v27.4b[1]
udot v22.4s, v10.16b, v24.4b[1]
udot v19.4s, v10.16b, v14.4b[1]
udot v18.4s, v10.16b, v0.4b[1]
udot v5.4s, v10.16b, v27.4b[1]
udot v20.4s, v15.16b, v24.4b[1]
udot v7.4s, v15.16b, v14.4b[1]
udot v2.4s, v15.16b, v0.4b[1]
udot v3.4s, v15.16b, v27.4b[1]
udot v21.4s, v29.16b, v24.4b[2]
udot v6.4s, v29.16b, v14.4b[2]
udot v28.4s, v29.16b, v0.4b[2]
udot v1.4s, v29.16b, v27.4b[2]
udot v22.4s, v8.16b, v24.4b[2]
udot v19.4s, v8.16b, v14.4b[2]
udot v18.4s, v8.16b, v0.4b[2]
udot v5.4s, v8.16b, v27.4b[2]
udot v20.4s, v12.16b, v24.4b[2]
udot v7.4s, v12.16b, v14.4b[2]
udot v2.4s, v12.16b, v0.4b[2]
udot v3.4s, v12.16b, v27.4b[2]
udot v23.4s, v25.16b, v24.4b[3]
udot v21.4s, v26.16b, v24.4b[3]
udot v22.4s, v30.16b, v24.4b[3]
udot v20.4s, v9.16b, v24.4b[3]
udot v17.4s, v25.16b, v14.4b[3]
udot v6.4s, v26.16b, v14.4b[3]
udot v19.4s, v30.16b, v14.4b[3]
udot v7.4s, v9.16b, v14.4b[3]
udot v16.4s, v25.16b, v0.4b[3]
udot v28.4s, v26.16b, v0.4b[3]
udot v18.4s, v30.16b, v0.4b[3]
udot v2.4s, v9.16b, v0.4b[3]
udot v4.4s, v25.16b, v27.4b[3]
udot v1.4s, v26.16b, v27.4b[3]
udot v5.4s, v30.16b, v27.4b[3]
udot v3.4s, v9.16b, v27.4b[3]
mov x4, x5
b.ne .LBB0_2
After more debugging, I'm now wondering if this may not be a bug at all, but a side-effect of instruction scheduling.
The prfm instruction is modeled as "has side effects" (which is true and necessary, otherwise it would get optimized away entirely), which constrains the optimizer from being aggressive about moving around surrounding instructions.
I think what may be happening here is that in the no-prefetch case (or one-prefetch-at-top-of-loop), the machine-instruction scheduler is able to shuffle around instructions effectively, whereas the addition of the extra prfm splits it into two basic blocks, so we end up with suboptimal register allocation across them, either out of necessity (literally out of registers) or just the fact that optimal register allocation is hard.
One workaround I'm looking at is pushing all of the prfm
instructions to the top of the inner loop, in the hopes that the rest of the loop can avoid spills; this isn't ideal, of course, as interleaving the prefetch instructions would be preferable for fairly obvious reasons.
Despite my suspicions, I'm going to keep this issue open for the time being, as I'm still not 100% convinced that the situation can't be improved.
That does sound plausible. I suppose a "proper" fix would be to add hasPerfSideEffects
(a.k.a. please don't delete me) as another level intrinsics might sit at.
Would it be viable to mark prefetches as mayLoad rather than hasSideEffects?
I did wonder that, but my bet is a mayLoad
that doesn't actually produce any values is fair game for dead removal. We could change that edge-case of course.
I wouldn't bet on that. At the codegen layer, I don't think we can remove dead loads because we might be changing trap behavior.
mayStore might work as well.
Thanks for the tips, I'll try out mayLoad and/or mayStore and see how the results look.
Attached prfm-may-store.diff
(2362 bytes, text/plain): Patch to mark prfm as mayStore instead of hasSideEffects
FYI: I made an experimental change to Halide's codegen so that all our calls to
the Prefetch primitive in LLVM get hoisted to the top of the enclosing loop
(rather than interleaved with other instructions in the loop).
The good news: as I suspected, this eliminates all the spills I see before, and
I end up with nice clean aarch64 assembly that looks something like the code
below.
The not-good news: unsurprisingly, pushing the prefetch calls further away from
the instructions that actually need the prefetch result dramatically degrades
any benefit the prefetch would give us. (This varies some depending on the type
of ARM core running the code; e.g. on an A55 core, I saw essentially no speed
difference from with vs without prefetches.)
Because of the limited performance improvement, I don't think this approach is
likely to be worth pursuing as a long-term fix, unfortunately.
I'd still love to find a way to insert prefetch calls in an interleaved fashion
without injecting unnecessary spills, but it's not clear to me how doable this
is without nontrivial effort.
(snippet:)
.LBB0_103: // %"for convolved.s1.r19$z.r124.us.us.us"
add x4, x18, x30
add x13, x11, x30
add x6, x14, x30
add x15, x8, x30
add x26, x28, x30
prfm pldl1strm, [x4, #16]
prfm pldl1strm, [x13, #16]
prfm pldl1strm, [x6, #16]
prfm pldl1strm, [x15, #16]
prfm pldl1strm, [x26, #16]
ldr q7, [x4]
ldp q19, q16, [x3, #-32]
ldr q18, [x13]
ldr q2, [x6]
ldr q4, [x15]
ldr q0, [x26]
ldp q1, q17, [x19, #-32]
udot v12.4s, v19.16b, v7.4b[0]
udot v8.4s, v19.16b, v18.4b[0]
udot v28.4s, v19.16b, v2.4b[0]
udot v24.4s, v19.16b, v4.4b[0]
udot v20.4s, v19.16b, v0.4b[0]
ldp q3, q19, [x24, #-32]
udot v13.4s, v1.16b, v7.4b[0]
udot v9.4s, v1.16b, v18.4b[0]
udot v29.4s, v1.16b, v2.4b[0]
udot v14.4s, v3.16b, v7.4b[0]
udot v10.4s, v3.16b, v18.4b[0]
udot v30.4s, v3.16b, v2.4b[0]
udot v25.4s, v1.16b, v4.4b[0]
udot v21.4s, v1.16b, v0.4b[0]
udot v26.4s, v3.16b, v4.4b[0]
udot v22.4s, v3.16b, v0.4b[0]
...etc...
There are probably 2 things going on here: how the prefetch intrinsic effects
codegen as has been discussed so far, and the potential performance uplift.
Software prefetching is probably completely counter productive if the next
instruction performs the load:
prfm pldl1strm, [x27]
ldr q28, [x27]
Modern cores and prefetchers should hopefully predict this access, making the
prefetch instruction unnecessary, possibly even hurting performance. On smaller
cores, with less sophisticated prefetchers, there might be a use case for this
though. This is all very hand waivy though (and unfortunately not described in
e.g. the Cortex software optimisation guides), but basically what I want to say
is that it is worth experimenting if you actually need this and helps your case.
Thanks for pointing that out, that was a bug that we've hopefully since fixed. When troubleshooting performance, the first thing that stood out was the stack spills, I didn't get as far as noticing that the prefetch and load were from the same address.
prefetch.ll
(98694 bytes, text/plain)prefetch_2.ll
(30317 bytes, text/plain)prfm-may-store.diff
(2362 bytes, text/plain)Steps to repeat:
(1) See enclosed file prefetch.ll -- notice that there are several calls to
@llvm.prefetch.p0i8
, with all but the first commented out(2) Compile to aarch64 assembly with llc -march=aarch64 ~/prefetch.ll -o - -O3 -mattr=+dotprod > ~/prefetch.s
(3) Examine output, search for
prfm
, see block similar to(4) Edit prefetch.ll to uncomment the call to
llvm.prefetch.p0i8
on line 459, re-run llc(5) Search for
prfm
again, and note that the start of the block now contains numerous vector-register spills that appear to be completely unnecessary:These spills don't seem to make any sense: the only instruction that should have been added here was the second
prfm
instruction, and it doesn't depend on any of the vector registers being spilled and reloaded. Is something aboutprefetch
affecting this (e.g., confusing the lifetime analysis for registers loaded from the prefetch location)?