Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[x86, loop vectorizer] Smaller VF preferred when VFs have the same cost #34661

Open Quuxplusone opened 6 years ago

Quuxplusone commented 6 years ago
Bugzilla Link PR35687
Status NEW
Importance P enhancement
Reported by Daniel Neilson (ddneilson@ieee.org)
Reported on 2017-12-18 10:44:45 -0800
Last modified on 2020-04-29 02:22:37 -0700
Version trunk
Hardware PC All
CC efriedma@quicinc.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk
Fixed by commit(s)
Attachments same-cost-vf.ll (2894 bytes, text/plain)
with-rL317576.out (29025 bytes, text/plain)
without-rL317576.out (29543 bytes, text/plain)
debug-diff.out (2458 bytes, text/plain)
Blocks
Blocked by
See also
Created attachment 19572
IR to demonstrate

The attached IR was distilled down from one of our internal tests that degraded
~50% with the landing of https://reviews.llvm.org/rL317576 (Fix default cost
model for cast op in X86). That change had the effect of calculating the cost
of a bitcast fed by a load as 0 (due to CodeGen/BasicTTIImpl.h lines 561-568 --
"If this is a zext/sext of a load, return 0 if the corresponding extending load
exists on target"). The result is that the vectorized loops in this IR end up
being 8-elements wide instead of 16; resulting in about half the throughput.

The obvious fix -- of changing the vectorizer to choose the larger VF when
costs are the same -- does fix our issue, but fails two tests:
 Transforms/LoopVectorize/X86/avx1.ll
 Transforms/LoopVectorize/X86/fp64_to_uint32-cost-model.ll

I'm filing this bug so that someone more knowledgable about loop vectorization
on x86 can chime in with a suggested way-forward.

For avx1.ll, the loop in @read_mod_i64 has the same cost for VFs 2 and 4; so,
the change would have the VF as 4 instead of 2. The test would seem to indicate
that this is undesirable with slow-unaligned-mem-32.

For fp64_to_uint32-cost-model, again the loop has the same cost at VFs 1, 2,
and 4. However, the test indicates a preference for a scalarized loop in this
case.

I don't know the nuances of x86 vectorization heuristics well enough to know
whether these two failing tests are invariants that should be addressed by the
cost model. It does seem sensible to me to desire the widest possible vector,
so perhaps there are deficiencies in the cost model that would have to be
addressed?
Quuxplusone commented 6 years ago

Attached same-cost-vf.ll (2894 bytes, text/plain): IR to demonstrate

Quuxplusone commented 6 years ago

resulting in about half the throughput.

This means the cost modeling is way off. The "cost" the vectorizer prints is the cost per scalar iteration, so the estimated cost of each vectorized iteration at VF 4 is twice as expensive as the estimated cost at VF 2.

From the debug output, maybe the cost of the "sext" isn't getting computed correctly?

Quuxplusone commented 6 years ago

Attached with-rL317576.out (29025 bytes, text/plain): LV debug & output with rL317576

Quuxplusone commented 6 years ago

Attached without-rL317576.out (29543 bytes, text/plain): LV debug & output without rL317576

Quuxplusone commented 6 years ago

Attached debug-diff.out (2458 bytes, text/plain): Diff of the LV debug trace with & without rL317576

Quuxplusone commented 6 years ago

ping. Anyone looking at this?