Inefficient code for fp16 vectors

Quuxplusone commented 8 years ago


Bugzilla Link	PR27222
Status	NEW
Importance	P normal
Reported by	Pirama Arumuga Nainar (pirama@google.com)
Reported on	2016-04-05 13:25:29 -0700
Last modified on	2020-03-21 08:56:36 -0700
Version	trunk
Hardware	PC Linux
CC	ahmed@bougacha.org, anton@korobeynikov.info, llvm-bugs@lists.llvm.org, srhines@google.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also	PR23531

We generate inefficient code for half vectors for some architectures.  Consider
the following IR:

define void @add_h(<4 x half>* %a, <4 x half>* %b) {
entry:
  %x = load <4 x half>, <4 x half>* %a, align 8
  %y = load <4 x half>, <4 x half>* %b, align 8
  %0 = fadd <4 x half> %x, %y
  store <4 x half> %0, <4 x half>* %a
  ret void
}

LLVM currently splits and scalarizes vectors.  IOW, it splits the <4 x half>
into 4 half datum and operates individually on them.  This prevents the backend
from selecting vector load and vector conversion instructions.  The code
generated has repeated 16-byte loads, converstion to fp32, addition, conversion
back to fp16 and a 16-byte store.

Here's the code generated for ARM32:
        ldrh    r4, [r1, #6]
        ldrh    r3, [r0, #6]
        ldrh    r12, [r1]
        ldrh    r2, [r0, #4]
        ldrh    lr, [r0, #2]
        vmov    s0, r4
        ldrh    r4, [r1, #2]
        ldrh    r1, [r1, #4]
        vmov    s2, r3
        ldrh    r3, [r0]
        vmov    s6, r2
        vmov    s10, lr
        vmov    s12, r12
        vcvtb.f32.f16   s0, s0
        vcvtb.f32.f16   s2, s2
        vadd.f32        s0, s2, s0
        vmov    s4, r1
        vmov    s8, r4
        vmov    s14, r3
        vcvtb.f32.f16   s4, s4
        vcvtb.f32.f16   s6, s6
        vcvtb.f32.f16   s2, s8
        vcvtb.f32.f16   s8, s10
        vcvtb.f32.f16   s10, s12
        vcvtb.f32.f16   s12, s14
        vcvtb.f16.f32   s0, s0
        vadd.f32        s4, s6, s4
        vadd.f32        s2, s8, s2
        vadd.f32        s6, s12, s10
        vmov    r1, s0
        vcvtb.f16.f32   s4, s4
        vcvtb.f16.f32   s0, s2
        vcvtb.f16.f32   s2, s6
        strh    r1, [r0, #6]
        vmov    r1, s4
        strh    r1, [r0, #4]
        vmov    r1, s0
        strh    r1, [r0, #2]
        vmov    r1, s2
        strh    r1, [r0]

In comparison, the same code gets translated to the following for AArch64:
        ldr             d0, [x1]
        ldr             d1, [x0]
        fcvtl   v0.4s, v0.4h
        fcvtl   v1.4s, v1.4h
        fadd    v0.4s, v1.4s, v0.4s
        fcvtn   v0.4h, v0.4s
        str             d0, [x0]
        ret
.Lfunc_end0:

This happens for the architectures whose LLVM backends don't natively support
half (such as x86, x86_64 and ARM32).

Quuxplusone commented 8 years ago

What would you expect to be generated on ARM32 then? fp16 is storage-only type there.

Quuxplusone commented 8 years ago

Hi Anton, fp16 is a storage-only type and LLVM already performs operations on fp16 data by promoting them to fp32. For ARM32 with NEON and the 'half' feature, it'd be more efficient to use the vector-variant of VCVT (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489i/Bcfjicfj.html) instead of the scalar variant. Doing so would produce code similar to the AArch64 output in my initial comment.

Quuxplusone commented 8 years ago

The link I pasted in previous comment doesn't directly go to the instruction's reference page. See Section 5.44 in http://infocenter.arm.com/help/topic/com.arm.doc.dui0489i/DUI0489I_arm_assembler_reference.pdf.

Quuxplusone / LLVMBugzillaTest

Inefficient code for fp16 vectors #27221