ARM-software / acle

Arm C Language Extensions (ACLE)
Other
84 stars 52 forks source link

[BUG] Missing intrinsics for AArch32 instructions VMLA.F16 and VMLS.F16 #216

Open Maratyszcza opened 1 year ago

Maratyszcza commented 1 year ago

Alongside VFMA.F16/VFMS.F16, AArch32 offers VMLA.F16/VMLS.F16 instructions which performs multiply-add operation with intermediate rounding. Importantly, the vector-by-vector lane form (e.g. VMLA.F16 Qd, Qn, Dm[x]) on AArch32 is supported only for VMLA/VMLS instructions, and not for VFMA/VFMS instructions.

The NEON intrinsics specification lacks intrinsics for the VMLA/VMLS instructions. In particular, it makes impossible to achieve peak performance on half-precision matrix-matrix multiplication in AArch32 using NEON intrinsics, because the optimal implementation would use the VMLA.F16 Qd, Qn, Dm[x] instructions.

I request that NEON specification be updated to include the following intrinsics for AArch32:

vhscampos commented 1 year ago

Hi @Maratyszcza , thanks for your issue report. And apologies for the late response.

If possible, we encourage you to contribute with a Pull Request that addresses this issue. We will be happy to review it.