Proposal to add mul 32x32=64

jan-wassenberg commented 4 years ago

I haven't seen _mm_mul_epu32 = PMULUDQ mentioned here. This is important for crypto/hashing, and a quick search on Github turned up uses in poly1305, QNNPACK, skia, HighwayHash.

The proposed semantics are: u64x2 r = mul_even(u32x4 a, u32x4 b), where

r[i] = u64(a[2*i]) * b[2*i]

MULE on PPC has the same semantics as PMULUDQ on x86. ARM can vmull_u32 after two vuzp1q_u32.

Does this seem like a useful operation to add?

ngzhian commented 4 years ago

I am supportive of adding this. (V8 uses this in a bunch of places, e.g. in x64 we use pmuludq for i64x2mul, so this proposal will expose the existing instruction.)

jan-wassenberg commented 4 years ago

For completeness, here's a proposed ARM implementation:

// Multiplies even lanes (0, 2 ..) and places the double-wide result into
// even and the upper half into its odd neighbor lane.
Vec128<int64_t> MulEven(const Vec128<int32_t> a,
                        const Vec128<int32_t> b) {
  int32x4_t a_packed = vuzp1q_s32(a.raw, a.raw);
  int32x4_t b_packed = vuzp1q_s32(b.raw, b.raw);
  return Vec128<int64_t>(
      vmull_s32(vget_low_s32(a_packed), vget_low_s32(b_packed)));
}

dtig commented 4 years ago

Thanks @jan-wassenberg for adding mappings to x64, and Aarch64 instructions. This is possibly a useful operation, but the ARM sequence in this case is somewhat unfortunate, concretely on Aarch64, this would map to 5 instructions - 2 UZP1 Vd.4S,Vn.4S,Vm.4S instructions for the vuzp1q_s32 intrinsic, 2 DUP Vd.1D,Vn.D[0] instructions for the vget_low_s32 intrinsic, and an SMULL Vd.2D,Vn.2S,Vm.2S for the vmull_s32 intrinsic.

For ARMv7, the code sequence is somewhat fuzzy because though the DUP, and the SMULL instructions are supported on v7, the UZP1 ASFAIK instruction doesn't have a direct mapping. Do you happen to know what the sequence of this would be on 32-bit ARM architectures?

rrwinterton commented 4 years ago

So @jan-wassenberg @dtig what you would like to have is a register instruction as opposed to a load extend instruction proposed in Introduce Load and Extend issue #98 ? It depends on what you are doing I guess? If you have to load anyway you could keep it in a register. It depends on how many you want and if you start to spill? I probably am missing something.

jan-wassenberg commented 4 years ago

@dtig You're welcome! I believe the DUPs can be elided? https://gcc.godbolt.org/z/zYJGPJ

@rrwinterton The proposal is aimed at multiplication, it can be independent of load-extend. We're trying to expose pmuludq, because IIRC only AVX3 has full 64x64 bit multiplication.

dtig commented 4 years ago

@jan-wassenberg Not that I can see for pre-ARMv8, am I missing something?

I would like to hear more opinions about this, and perhaps some links to code in the wild if possible? As with general positive interest, would not be opposed to prototyping this and benchmarking as well considering most of the concerns about performance are from me.

jan-wassenberg commented 4 years ago

@dtig Ah yes, looks like those are A64 instructions - I am not very familiar with v7. Here are some users of the VMULL/MULUDQ instruction: https://github.com/google/skia/blob/master/include/private/SkNx_neon.h https://github.com/google/highwayhash/blob/master/highwayhash/hh_neon.h https://github.com/floodyberry/poly1305-opt/blob/master/app/extensions/poly1305/poly1305_sse2-64.inc

dtig commented 4 years ago

Thanks @jan-wassenberg - originally we planned to optimize the subset of operations just for Neon support but as more engines/applications have come up it does look like they care about performance on some subset of older ARM devices especially on mobile, as most of the operations till now have codegen that's fairly reasonable so far this hasn't been an issue, but with some of the newer proposed operations it's less clear how much we should index on this particular aspect. To be consistent across architectures, though it would make sense to bias this towards it being future facing. I'll take an AI to get a more accurate sense for this and open an issue, or respond here.

bjacob commented 4 years ago

As a tangent, I would like to mention that for fixed-point calculations, it would be useful to also have fixed-point multiplication instructions, e.g. 32x32=32 and 16x16=16, similar to ARM SQRDMULH. I mention this here because some may think that the availability of a 32x32=64 integer multiplication would remove the need for that, but that would be sub-optimal: staying within 32bit means doing 4 scalar operations per 128-bit vector operation, and most applications want to use the rounding flavor (SQRDMULH not SQDMULH) which would require a few more instructions to emulate if the instruction is missing, which in practice would result in applications making compromises between accuracy and performance.

(This is critical to integer-quantized neural network applications as performed by TensorFlow Lite using the ruy matrix multiplication library, see e.g. the usage of these instructions here, https://github.com/google/ruy/blob/57e64b4c8f32e813ce46eb495d13ee301826e498/ruy/kernel_arm64.cc#L517 )

bjacob commented 4 years ago

Branched into Issue #221.

Maratyszcza commented 3 years ago

32x32->64 multiplication was added in #376, related fixed-point multiplication instruction (Q15 format) was added in #365

ngzhian commented 3 years ago

Closing since we have 32x32=64, as Marat pointed out.

WebAssembly / simd

Proposal to add mul 32x32=64 #175