[BUG] Missing intrinsics for AArch32 instructions VMLA.F16 and VMLS.F16

Alongside VFMA.F16/VFMS.F16, AArch32 offers VMLA.F16/VMLS.F16 instructions which performs multiply-add operation with intermediate rounding. Importantly, the vector-by-vector lane form (e.g. VMLA.F16 Qd, Qn, Dm[x]) on AArch32 is supported only for VMLA/VMLS instructions, and not for VFMA/VFMS instructions.

The NEON intrinsics specification lacks intrinsics for the VMLA/VMLS instructions. In particular, it makes impossible to achieve peak performance on half-precision matrix-matrix multiplication in AArch32 using NEON intrinsics, because the optimal implementation would use the VMLA.F16 Qd, Qn, Dm[x] instructions.

I request that NEON specification be updated to include the following intrinsics for AArch32:

vmla_f16 (VMLA.F16 Dd, Dn, Dm)
vmls_f16 (VMLS.F16 Dd, Dn, Dm)
vmlaq_f16 (VMLA.F16 Qd, Qn, Qm)
vmlsq_f16 (VMLS.F16 Qd, Qn, Qm)
vmla_lane_f16 (VMLA.F16 Dd, Dn, Dm[x])
vmls_lane_f16 (VMLS.F16 Dd, Dn, Dm[x])
vmlaq_lane_f16/vmlaq_laneq_f16 (VMLA.F16 Qd, Qn, Dm[x])
vmlsq_lane_f16/vmlsq_laneq_f16 (VMLS.F16 Qd, Qn, Dm[x])

ARM-software / acle

[BUG] Missing intrinsics for AArch32 instructions VMLA.F16 and VMLS.F16 #216