WebAssembly / relaxed-simd

Relax the strict determinism requirements of SIMD operations.
Other
43 stars 9 forks source link

Relaxed BFloat16 Dot Product instruction #77

Open Maratyszcza opened 2 years ago

Maratyszcza commented 2 years ago

Introduction

BFloat16 is a 16-bit floating-point format that represents the IEEE FP32 numbers truncated to the high 16 bits. BFloat16 numbers have the same exponential range as the IEEE FP32 numbers, but way fewer mantissa bits, and is primarily used in neural network computations which are very tolerable to reduced precision. Despite being a non-IEEE format, BFloat16 computations are supported in the recent and upcoming processors from Intel (Cooper Lake, Alder Lake, Sapphire Rapids), AMD (Zen 4), and ARM (Cortex-A510, Cortex-A710, Cortex-X2), and can be efficiently emulated on older hardware by zero-padding BFloat16 numbers with zeroes to convert to IEEE FP32.

Both x86 and ARM BFloat16 extensions provide instructions to compute an FP32 dot product of two two-element BFloat16 elements with addition to the FP32 accumulator. However, there're some differences in details:

x86 introduced support for BFloat16 computations with the AVX512-BF16 extension, and the BFloat16 dot production functionality is exposed via the VDPBF16PS (Dot Product of BF16 Pairs Accumulated into Packed Single Precision) instruction. The instruction VDPBF16PS dest, src1, src2 computes

dest.fp32[i] = FMA(cast<fp32>(src1.bf16[2*i]), cast<fp32>(src2.bf16[2*i]), dest.fp32[i])
dest.fp32[i] = FMA(cast<fp32>(src1.bf16[2*i+1]), cast<fp32>(src2.bf16[2*i+1]), dest.fp32[i])

for each 32-bit lane i in the accumulator register dest. Additionally, denormalized numbers in inputs are treated as zeroes and denormalized numbers in outputs are replaced with zeroes.

ARM added support for BFloat16 computations with the ARMv8.2-A BFDOT instruction (mandatory with ARMv8.6-A), which implements the following computation:

temp.fp32[i] = cast<fp32>(src1.bf16[2*i]) * cast<fp32>(src2.bf16[2*i])
temp2.fp32[i] = cast<fp32>(src1.bf16[2*i+1]) * cast<fp32>(src2.bf16[2*i+1])
temp.fp32[i] = temp.fp32[i] + temp2.fp32[i]
dest.fp32[i] = dest.fp32[i] + temp.fp32[i]

where all multiplications and additions are unfused and use non-IEEE Round-to-Odd rounding mode. Additionally, denormalized numbers in inputs are treated as zeroes, denormalized numbers in outputs are replaced with zeroes.

ARMv9.2-A introduced additional "Extended BFloat16" (EBF16) mode, which modifies the behavior of BFDOT as follows:

temp.fp32[i] = cast<fp32>(src1.bf16[2*i] * src2.bf16[2*i] + src1.bf16[2*i+1] * src2.bf16[2*i+1])
dest.fp32[i] = dest.fp32[i] + temp.fp32[i]

where temp.fp32[i] is calculated as a fused sum-of-products operations with only a single (standard, Round-to-Nearest-Even) rounding at the end.

Furthermore, ARMv8.6-A introduced BFMLALB/BFMLALT instructions which perform widening fused multiply-add of even/odd BFloat16 elements to an IEEE FP32 accumulator. A pair of these instructions can be similarly used to implement BFloat16 Dot Product primitive.

What are the instructions being proposed?

I propose a 2-element dot product with accumulation instruction with BFloat16 input elements and FP32 accumulator & output elements. The instruction has relaxed semantics to allow lowering to native VDPBF16PS (x86) and BFDOT (ARM) instructions, as well as FMA instructions where available.

I suggest f32x4.relaxed_dot_bf16x8_add_f32x4 as a tentative name for the instruction. In a sense it is a floating-point equivalent of the i32x4.dot_i16x8_add_s instruction proposed in WebAssembly/simd#127.

What are the semantics of these instructions?

y = f32x4.relaxed_dot_bf16x8_add_f32(a, b, c) computes

y.fp32[i] = y.fp32[i] + cast<fp32>(a.bf16[2*i]) * cast<fp32>(b.bf16[2*i]) + cast<fp32>(a.bf16[2*i+1]) * cast<fp32>(b.bf16[2*i+1])

The relaxed nature of the instruction manifests in several allowable options:

Evaluation ordering options

We permit two evaluation orders for the computation:

Fusion options

Rounding options

How will these instructions be implemented?

x86/x86-64 processors with AVX512-BF16 instruction set

ARM64 processors with BF16 extension (using BFDOT instructions)

ARM64 processors with BF16 extension (using BFMLALB/BFMLALT instructions)

Reference lowering through the WAsm Relaxed SIMD instruction set

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

The use of AVX512-BF16 VDPBF16PS instruction can be detected through testing for behavior on denormal inputs and outputs. The use of ARM BFDOT instruction can be detected through testing for behavior on denormal inputs and outputs, or testing for Round-to-Odd rounding. However, an ARM implementation using a pair of BFMLALB/BFMLALT instructions would be indistinguishable from a generic lowering using FMA instructions (but probably slower than BFDOT lowering).

What use cases are there?

ngzhian commented 2 years ago

@yurydelendik @dtig please add some engine implementor feedback on this instruction. specifically, this instruction requires AVX512-BF16 extension on the x86 side, and ARMv8.6-A with BF16 extension on the ARM side. Please comment on possible support for these instructions.

yurydelendik commented 2 years ago

(Typo to fix: in reference lowering code y = f32x4.relaxed_fma(a_lo, b_lo, y) -> y = f32x4.relaxed_fma(a_lo, b_lo, c))

abrown commented 2 years ago

I wanted to also comment on what we discussed about bf16 conversions to and from fp32 (correct me if anything is incorrect): @Maratyszcza is proposing that the native conversion instructions (e.g., VCVTNE2PS2BF16) not be included in relaxed SIMD because the conversion can be done by zero-extending the bf16 to fill the bottom 16 bits of the fp32 (or truncating, in the opposite-direction conversion), since this will have the same semantics as round-to-zero. @Maratyszcza made the point that these conversions, event though lossy, are rare in the domain he works in (?).

Maratyszcza commented 2 years ago

BFloat16 numbers are usually only used for storage, up-casted to IEEE FP32 for computations. The proposed f32x4.relaxed_dot_bf16x8_add_f32x4 instruction too does internal computations in FP32 and produce FP32 result. Only the final result need to be down-casted to BFloat16 for store to memory, this downcast can be done by truncating the high 16 bits of the FP32 number, which is sufficiently accurate if it happens relatively rarely, e.g. at operator boundaries in NN computations.

justinmichaud commented 2 years ago

I would like to add some implementer feedback from JSC/WebKit.

There is a lot of interesting feedback in this thread, and I have a few questions.

Some of this was mentioned in the meeting, but do we have any data about:

In general, we are concerned about instructions that might make deterministic fingerprinting much easier, and also about how they might cause compatibility/interoperability issues for users who browse the web on less-common CPU architectures.

An important part of the Relaxed-SIMD proposal for us in general is the fact that it justifies this risk with very compelling performance data. We would not consider implementing BFloat16 unless we have proof that this would speed up an entire class of reasonably popular programs.

Overall, since Relaxed-SIMD is in stage 3, we think that this change should be put in a separate proposal.

Maratyszcza commented 2 years ago

Thanks for the feedback @justinmichaud (and great to see JSC/WebKit being more active in WAsm SIMD)!

The performance/speedup of using this instruction, particularly on non-AVX512 intel chips?

Performance on x86 CPUs without AVX512-BF16 extension would be similar to software emulation of the f32x4.relaxed_dot_bf16x8_add_f32x4 with Relaxed SIMD instructions. It would still be faster than end-to-end FP32 processing if the computation is bandwidth-bound on the machine. Note that when BF16 format was originally introduced, it was handled purely in software; hardware support arrived years later.

On Galaxy S22 with Snapdragon 8 Gen 1 I see throughput of 4 BFDOT/cycle (Cortex-X2 cores) / 2 BFDOT/cycle (Cortex-A710 cores) / 1 BFDOT/cycle (Cortex-A510 cores). This is the same throughput as for single-precision FMLA instructions, but each BFDOT instruction produce twice more results, leading to twice higher peak performance.

How often we expect this instruction to be useful? As in, how general is it?

This instruction is known to be helpful for neural network computations (both for inference and training). Background Effects in Google Meet and Teachable Machine are examples of applications using such functionality.

Are there any other mainstream programming languages or libraries that support it?

The native VDPBF16PS and BFDOT instructions are exposed in C/C++ as intrinsic functions _mm512_dpbf16_ps and vbfdotq_f32 (+variants of these). Among machine learning frameworks, both TensorFlow and PyTorch support BFloat16 data type (with emulation in lieu of hardware support).

we are concerned about instructions that might make deterministic fingerprinting much easier Note that the implementation using BFMLALB/BFMLALT instructions on ARM is bitwise-compatible with pre-BF16 emulation path.

Overall, since Relaxed-SIMD is in stage 3, we think that this change should be put in a separate proposal.

Unlike native platforms, WebAssembly doesn't have the capability to detect supported WebAssembly ISA extensions at run-time: if a WAsm engine doesn't support any instruction in a WAsm module, the whole will be rejected, even if the instruction is never executed in runtime. Unfortunately, this means that WebAssembly extensions can't be as granular as native ISA extensions: developers have to rebuild the whole webapp for each set of WAsm extensions they target, and having an extension with a single instruction in it is overburdening for the developers.

Please note that it is permissible to add new instructions at stage 3 (WebAssembly SIMD added dozens of instructions while at Stage 3). It is Stage 4 is where the spec is frozen.

justinmichaud commented 2 years ago

Thanks for the fast response!

On Galaxy S22 with Snapdragon 8 Gen 1 I see throughput of 4 BFDOT/cycle (Cortex-X2 cores) / 2 BFDOT/cycle (Cortex-A710 cores) / 1 BFDOT/cycle (Cortex-A510 cores). This is the same throughput as for single-precision FMLA instructions, but each BFDOT instruction produce twice more results, leading to twice higher peak performance.

Thanks! Is there a real benchmark or library that is available to be compiled with this instruction vs just the normal set of relaxed-simd instructions? The two examples that you linked seem interesting, but I don't see how they could be ported today.

I am also wondering if there is an implementation using BFMLALB/BFMLALT that is complete enough to collect real-world performance numbers, since that is the only implementation we would consider on ARM.

We really need to see data in the context of a full benchmark or application here, because on our CPUs instruction throughput doesn't usually indicate general performance.

Note that the implementation using BFMLALB/BFMLALT instructions on ARM is bitwise-compatible with pre-BF16 emulation path.

This is good to know. So if I am understanding things correctly, the current state of the world is that this is software emulated on every single chip that is available today except for ARMv8+, and ARMv8+ can only be distinguished from emulation by performance (which is fine)?

Please note that it is permissible to add new instructions at stage 3 (WebAssembly SIMD added dozens of instructions while at Stage 3). It is Stage 4 is where the spec is frozen.

This makes sense, but I would just like to note that this instruction seems like a pretty big departure from the others in the proposal. In addition, it seems like there are almost no desktop CPUs that natively support this instruction yet, so maybe it is too soon to standardize?

Maratyszcza commented 2 years ago

Is there a real benchmark or library that is available to be compiled with this instruction vs just the normal set of relaxed-simd instructions?

Alibaba's MNN and Tencent's NCNN support BF16 computations (via software conversions to/from FP32), but AFAICT neither use the ARM BF16 extension yet. It is also worth noting that native software targeting ARM BF16 is unlikely to use BFDOT or BFMLALB + BFMLALT pair, because BF16 extension also includes a more powerful BFloat16 matrix multiplication instruction BFMMLA. The latter, however, is too exotic and hard to emulate to be exposed in WebAssembly.

My plan is to implement BF16 GEMM (matrix-matrix multiplication) kernels in XNNPACK using generic NEON / BFDOT / BFMLALB + BFMLALT pair to evaluate the potential of the f32x4.relaxed_dot_bf16x8_add_f32x4 instruction. GEMM is responsible for 50-80% of runtime in the convolutional neural network (now-common for Computer Vision tasks) and often 90%+ of runtime in the transformer-type neural networks (now-common in Natural Language Processing and replacing CNNs for Computer Vision), so it is a good proxy for the overall neural network inference performance.

So if I am understanding things correctly, the current state of the world is that this is software emulated on every single chip that is available today except for ARMv8+, and ARMv8+ can only be distinguished from emulation by performance (which is fine)?

This is correct. To be specific, BF16 extension on ARM was introduced as optional in ARMv8.2-A and became mandatory in ARMv8.6-A.

This makes sense, but I would just like to note that this instruction seems like a pretty big departure from the others in the proposal. In addition, it seems like there are almost no desktop CPUs that natively support this instruction yet, so maybe it is too soon to standardize?

I see little risk in adding this instruction, as BF16 extension is mandatory in ARMv8.6-A (and ARMv9), and thus eventually all ARM-based computers will have it. The latest Windows-based laptops (e.g. Lenovo ThinkPad X13s) already support BF16 extension. In the desktop x86 space, AMD Zen 4 is reported to support AVX512-BF16, so Intel will have to match to compete.

However, if we don't add this instruction, WebAssembly SIMD will miss out on the nearly 2X speedup for AI models, which would put privacy-focused technologies like Federated Learning as risk.

penzn commented 2 years ago

I see little risk in adding this instruction, as BF16 extension is mandatory in ARMv8.6-A (and ARMv9), and thus eventually all ARM-based computers will have it. The latest Windows-based laptops (e.g. Lenovo ThinkPad X13s) already support BF16 extension. In the desktop x86 space, AMD Zen 4 is reported to support AVX512-BF16, so Intel will have to match to compete.

To reiterate an important point to make sure it doesn't get lost in a long discussion: we need to hear back from engine maintainers before we conclude those instructions are available to the engines. Implementing support for new ISA extension requires implementing new decode, operands, etc, which comes with maintenance costs. I have brought this up before and it did not look like there was interest (I personally would welcome this changing 😉).

yurydelendik commented 2 years ago

please add some engine implementor feedback on this instruction. specifically, this instruction requires AVX512-BF16 extension on the x86 side, and ARMv8.6-A with BF16 extension on the ARM side. Please comment on possible support for these instructions.

I opened the issue at https://bugzilla.mozilla.org/show_bug.cgi?id=1778751 with intent to implement the reference lowering in SpiderMonkey. We feel that there is a need to support BFloat16 for matrix multiplication / ML use cases. In the future we will adjust the solution for ARM (and AVX) BF16 extensions as they become popular.

yurydelendik commented 2 years ago

There is inconsistency in arguments ordering. For fma(a, b, c) we are doing result = a + b * c, but for relaxed_dot_bf16x8_add_f32(a, b, c) we changed the accumulator placement result = dot(a, b) + c. Shall we keep a as running sum?

(Also reference lowering has typo related to this)

Maratyszcza commented 2 years ago

@yurydelendik Good point! However, it is fma(a, b, c) that is behaves inconsistently:

Lets have discussion in the FMA proposal #27

Maratyszcza commented 2 years ago

I implemented BF16 GEMM microkernels using FMA, ARMv8.2 BF16 BFDOT and ARMv8.2 BF16 BFMLAL[B/T] instructions in google/XNNPACK#3326 and benchmarked it on the Galaxy S22 Ultra phone with Snapdragon 888 Gen 1 SoC. Below I present results on the M=192, N=512, K=512 GEMM from the MobileNet v1 architecture (this GEMM case typically exhibits the best achievable performance among MobileNet v1 GEMMs):

Version Cortex-A510 Cortex-A710 Cortex-X2
NEON FMA 5.79 GFLOPS 17.5 GFLOPS 42.2 GFLOPS
BF16 BFMLALx 8.22 GFLOPS 30.5 GFLOPS 76.9 GFLOPS
BF16 BFDOT 13.1 GFLOPS 59.4 GFLOPS 122 GFLOPS

As expected, BFDOT delivers enormous speedup over the version using FP32 NEON FMA instructions. Less expected, BFMLALx instructions which result in bitwise-identical results with the FP32 NEON FMA microkernels, too deliver meaningful speedup. Below is the summary of performance improvements, assuming FP32 NEON FMA version as the baseline:

Version Cortex-A510 Cortex-A710 Cortex-X2
BF16 BFMLALx 1.42X speedup 1.74X speedup 1.82X speedup
BF16 BFDOT 2.26X speedup 3.39X speedup 2.89X speedup
Maratyszcza commented 2 years ago

@justinmichaud @yurydelendik @dtig @penzn @abrown Do you have further concerns about the benefits of these instructions? If no, I'd like to move on to voting on the proposal.

dtig commented 2 years ago

Voting sgtm, no concerns specifically but I don't expect that we'd be supporting the BFDOT/AVX512-BF16 (but will support BFMLALB/BFMLALT) lowering right now.

penzn commented 2 years ago

I don't expect that we'd be supporting the BFDOT/AVX512-BF16 (but will support BFMLALB/BFMLALT) lowering right now.

From the bug that @yurydelendik filed they are not planning to support either of the extensions, this doesn't make for good native support, particularly on x86. What is maybe more concerning to me though is that there is no consensus that it is feasible to add AVX512 support to engines in near future - let's say hardware support does appear in a few years, it would still require at least basics of AVX512 implemented in engines.

I would like to second @justinmichaud point about that "future proofing" and maybe expand on it a bit. There is the danger that we won't add semantics that will be introduced in the future, and also there is inertia in changing existing instruction sets that's why it is probably a better strategy to target what is already shipping (one can argue that AVX512_BF16 is already established, but that hits the issue described above).

If we do go forward with this we have to make this explicit somewhere in the proposal what instruction sets this would possibly target. I think there is a discussion of using FMA as baseline, maybe also mention that BF extensions can be targeted if available.

yurydelendik commented 2 years ago

From the bug that @yurydelendik filed they are not planning to support either of the extensions,

Just to note: the initial implementation does not reflect the future intent. Optimizations for ARM or AVX512 extensions will be added as soon as there will be confidence that they will become popular on clients hardware.

Maratyszcza commented 2 years ago

What is maybe more concerning to me though is that there is no consensus that it is feasible to add AVX512 support to engines in near future - let's say hardware support does appear in a few years, it would still require at least basics of AVX512 implemented in engines.

@dtig may correct me, but AFAIU the only roadblock to AVX512 support in the engines is the currently low market penetration of the technology, and it would be re-evaluated as AVX512 becomes more widely available.

If we do go forward with this we have to make this explicit somewhere in the proposal what instruction sets this would possibly target. I think there is a discussion of using FMA as baseline, maybe also mention that BF extensions can be targeted if available.

On x86 this instruction can be lowered to SSE2 (separate FP MUL + FP ADD), AVX (same, but 3-operand forms), FMA3/FMA4 (using FMA instructions instead of FP MUL + FP ADD), or AVX512-BF16. Of course, the further down this list, the better is the performance.

penzn commented 2 years ago

On x86 this instruction can be lowered to SSE2 (separate FP MUL + FP ADD), AVX (same, but 3-operand forms), FMA3/FMA4 (using FMA instructions instead of FP MUL + FP ADD), or AVX512-BF16. Of course, the further down this list, the better is the performance.

What I mean is that we would need BF16 extension in hardware to get speedup with this instruction, also that extension enables more variability (singe instruction on Arm vs the other lowering).

By the way, thank you for the rundown of the proposal. I have two naive follow up questions. Would we need other bfloat16 operations at some point? Also, what is the relationship between BF16 and FP16 and do we expect to go to FP16 at some point?

Maratyszcza commented 2 years ago

Would we need other bfloat16 operations at some point?

The only other commonly supported bfloat16 operation is conversion from FP32 to BF16 with rounding to nearest-even. However, this operation is potentially expensive to emulate when not supported in hardware.

Also, what is the relationship between BF16 and FP16 and do we expect to go to FP16 at some point?

The two formats are unrelated. Native instruction sets don't offer even conversion operations between these two floating-point formats (although all conversions can be done through FP32 intermediate without loss of accuracy).

penzn commented 2 years ago

Also, what is the relationship between BF16 and FP16 and do we expect to go to FP16 at some point?

The two formats are unrelated. Native instruction sets don't offer even conversion operations between these two floating-point formats (although all conversions can be done through FP32 intermediate without loss of accuracy).

Do you think we would need to support IEEE half precision at some point? If I understand this correctly, like with the bfloat16, there are fp16 extensions available or proposed, at least technically: AVX512 extension on x86, Arm v7 extension, and Arm v8.

Maratyszcza commented 2 years ago

Do you think we would need to support IEEE half precision at some point?

I think we'd need to support IEEE half-precision conversion, as hardware coverage for that is very good and efficient software emulation is available for SSE4/AVX1 processors. I don't expect support for half-precision arithmetics in WAsm any time soon because it is well-supported only on recent ARM64 processors (x86 has AVX512-FP16 specification, but no implementations, RISC-V doesn't even have a specification), and software emulation is very inefficient.

Maratyszcza commented 2 years ago

Please vote on the inclusion of the f32x4.relaxed_dot_bf16x8_add_f32 BFloat16 Dot Product instruction into the Relaxed SIMD proposal below:

:+1: For including the BFloat16 Dot Product instruction :-1: Against including the BFloat16 Dot Product instruction

dtig commented 2 years ago

A more nuanced vote: as the option isn't presented in the vote. I vote neutral. If merged into the spec, we'd support the BFMLALB/BFMLALT lowering on Arm at this time, because it's exposing the same level of entropy as exposing FMA. Not ruling out supporting the BFDOT or the AVX512 lowering in the future, but no plans to do so at this time.

ngzhian commented 2 years ago

Given the votes on https://github.com/WebAssembly/relaxed-simd/issues/77#issuecomment-1226191806 and https://github.com/WebAssembly/relaxed-simd/issues/77#issuecomment-1227820909 (I take the "thumbs-up" on Deepti's comment as a neutral vote), we have majority in support of adding this instruction.

@justinmichaud if you wish, please expand more (beyond what you have mentioned above) on your vote against.

82 will be canceled given the heads up on holidays. At the next sync I can present a summary of the instruction set, current status, as a preparation for presenting to the CG on phase advancement.

abrown commented 2 years ago

I thought I would have a chance to vote tomorrow and talk about this but since that is cancelled I would like to discuss it here. My vote depends on whether the x86 version will ever be implementable on major engines like V8. IIRC, there is no way to encode EVEX instructions (the ones specified in AVX512-related features) in V8 and I sensed that encoding was not a real possibility. If that is actually the case, then I would vote no 👎 to this instruction since there is no feasible path to implement this in V8, a major engine. If in fact the lack of support to encode EVEX instructions was only a temporary circumstance and V8 would accept a patch with the EVEX encoding for this instruction, then I would vote yes 👍. Can someone from the V8 team help me understand what the situation is with EVEX support?

dtig commented 2 years ago

@abrown There's two separate issues which are the entropy exposed by the native Dot product extensions (both with VDPBF16PS, and BFDOT), and the second issue being that these instructions are only available on a small subset of newer hardware.

@dtig may correct me, but AFAIU the only roadblock to AVX512 support in the engines is the currently low market penetration of the technology, and it would be re-evaluated as AVX512 becomes more widely available.

I'd like to explicitly second @Maratyszcza point above, when there's enough devices in the wild that use them, that would be a good reason to support these instructions being generated in V8. Though from an engine perspective, I'd like to make sure we do this consistently instead of adding adhoc encodings for some subset of SIMD operations. So the tl;dr is though we might not add these encodings at this time (both AVX512-BF16, BF16), I wouldn't rule out supporting them in the future.

tlively commented 2 years ago

This is now implemented in LLVM/Clang.

Maratyszcza commented 2 years ago

Regarding ARM BF16 extension on Apple processors: according to apple/llvm-project@677da09d0259d7530d32e85cb561bee15f0066e2, BF16 extension is already supported on Apple A15, A16, and M2 processors.