ARM64 SVE: mask conversion not always optimised away

a74nh commented 1 month ago

With DOTNET_TieredCompilation=0

      [MethodImpl(MethodImplOptions.NoInlining)]
      public static unsafe int foo(ref int* src, int length)
      {
          Vector<int> total = new Vector<int>(0);
          Vector<int> pred = (Vector<int>)Sve.CreateWhileLessThanMask32Bit(0, length);
          Vector<int> vec = Sve.LoadVector(pred, src);
          total = Sve.ConditionalSelect(pred, Sve.Add(total, vec), total);
          return (int)Sve.AddAcross(total).ToScalar();
      }

G_M25815_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x20]!
            mov     fp, sp
            stp     xzr, xzr, [fp, #0x10]   // [V02 loc0], [V02 loc0+0x08]
                        ;; size=12 bbWeight=1 PerfScore 2.50
G_M25815_IG02:  ;; offset=0x000C
            str     xzr, [fp, #0x10]
            str     xzr, [fp, #0x18]
            mov     w2, wzr
            whilelt p0.s, w2, w1
            mov     z16.s, p0/z, #1
            ptrue   p0.s
            cmpne   p0.s, p0/z, z16.s, #0
            ldr     q16, [fp, #0x10]    // [V02 loc0]
            ldr     x0, [x0]
            ld1w    { z17.s }, p0/z, [x0]
            ldr     q18, [fp, #0x10]    // [V02 loc0]
            add     z16.s, z16.s, z17.s
            sel     z16.s, p0, z16.s, z18.s
            str     q16, [fp, #0x10]    // [V02 loc0]
            ldr     q16, [fp, #0x10]    // [V02 loc0]
            ptrue   p0.s
            saddv   d16, p0, z16.s
            umov    x0, v16.d[0]
                        ;; size=72 bbWeight=1 PerfScore 39.50
G_M25815_IG03:  ;; offset=0x0054
            ldp     fp, lr, [sp], #0x20
            ret     lr
                        ;; size=8 bbWeight=1 PerfScore 2.00

These three lines are not required. They are converting mask -> vector -> mask

            mov     z16.s, p0/z, #1
            ptrue   p0.s
            cmpne   p0.s, p0/z, z16.s, #0

I suspect this is because there are two uses of pred - in conditional select and load vector.

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

a74nh commented 1 month ago

      [MethodImpl(MethodImplOptions.NoInlining)]
      public static unsafe int foo(ref int* src, int length)
      {
          Vector<int> pred = (Vector<int>)Sve.CreateWhileLessThanMask32Bit(0, length);
          Vector<int> vec = Sve.LoadVector(pred, src);
          return (int)Sve.AddAcross(vec).ToScalar();
      }

Reducing down and the conversion vanishes.

tannergooding commented 1 month ago

This is because Vector<int> pred produces a TYP_SIMD local and so that's what gets CSE'd.

We still need to add a step to the JIT that identifies cases of STORELCL(TYP_SIMD, ConvertMaskToVector(x)) and transforms it instead to STORELCL(TYP_MASK, x) and replaces usages with ConvertMaskToVector(lcl); or something to that effect.

tannergooding commented 1 month ago

If you were to change the code to instead be:

      [MethodImpl(MethodImplOptions.NoInlining)]
      public static unsafe int foo(ref int* src, int length)
      {
          Vector<int> total = new Vector<int>(0);
          Vector<int> vec = Sve.LoadVector(Sve.CreateWhileLessThanMask32Bit(0, length).AsInt32(), src);
          total = Sve.ConditionalSelect(Sve.CreateWhileLessThanMask32Bit(0, length).AsInt32(), Sve.Add(total, vec), total);
          return (int)Sve.AddAcross(total).ToScalar();
      }

Then I expect it would also be fixed, as the CreateWhileLessThanMask32Bit would be CSE'd instead and produce a TYP_MASK local.

We should still add the right step to the JIT to ensure the right CSE happens, however.

a74nh commented 1 month ago

Also:

      i += Sve.GetActiveElementCount(pz, pz);

becomes:

            ptrue   p2.b
            cmpne   p1.b, p2/z, z16.b, #0
            ptrue   p0.b
            cmpne   p0.b, p0/z, z16.b, #0
            cntp    x0, p1, p0.b
            add     x0, x0, x1

Both args to the cntp should be using the same predicate instead of doing two conversions.

tannergooding commented 1 month ago

That's likely because SVE isn't using mask constants currently and so LSRA doesn't do the checks for "is constant already in a register"

a74nh commented 2 weeks ago

This is because Vector<int> pred produces a TYP_SIMD local and so that's what gets CSE'd.

We still need to add a step to the JIT that identifies cases of STORELCL(TYP_SIMD, ConvertMaskToVector(x)) and transforms it instead to STORELCL(TYP_MASK, x) and replaces usages with ConvertMaskToVector(lcl); or something to that effect.

Switching the STORELCL is fairly easy.

Finding the uses to remove the ConvertMaskToVector is a little more tricky, depending on when this is done.

I could do this in HIR, and use LocalsTreeList() to find all local uses, which means doing this between morph and build SSA.
Or I could do this in LIR, and use BlockRange().TryGetUse()
Or elsewhere?

tannergooding commented 2 weeks ago

I believe the general idea is that we want to do this as its own pass after local morph, as per https://github.com/dotnet/runtime/pull/99608#discussion_r1523730454

a74nh commented 2 weeks ago

See Statement::LocalsTreeList. It allows to quickly check whether a statement contains a local you are interested in.

@jakobbotsch, picking up on your suggestion in 99608.

I'm using Statement::LocalsTreeList. This requires fgNodeThreading to be AllLocals.

when I find the ConvertVectorToMask/ConvertMaskToVector I remove them by bashing the node to GT_NOP and rewiring the parent to the child.

Then I need to fix up next/prev pointers. In previous PRs I did this using fgSetStmtSeq(). However, that requires fgNodeThreading to be AllTrees.

Any suggestions for how I should be fixing that up?

jakobbotsch commented 2 weeks ago

fgSequenceLocals does the same as fgSetStmtSeq for locals linking.

dotnet / runtime

ARM64 SVE: mask conversion not always optimised away #108241