Open Maratyszcza opened 2 years ago
@yurydelendik @dtig please add some engine implementor feedback on this instruction. specifically, this instruction requires AVX512-BF16 extension on the x86 side, and ARMv8.6-A with BF16 extension on the ARM side. Please comment on possible support for these instructions.
(Typo to fix: in reference lowering code y = f32x4.relaxed_fma(a_lo, b_lo, y)
-> y = f32x4.relaxed_fma(a_lo, b_lo, c)
)
I wanted to also comment on what we discussed about bf16 conversions to and from fp32 (correct me if anything is incorrect): @Maratyszcza is proposing that the native conversion instructions (e.g., VCVTNE2PS2BF16
) not be included in relaxed SIMD because the conversion can be done by zero-extending the bf16 to fill the bottom 16 bits of the fp32 (or truncating, in the opposite-direction conversion), since this will have the same semantics as round-to-zero. @Maratyszcza made the point that these conversions, event though lossy, are rare in the domain he works in (?).
BFloat16 numbers are usually only used for storage, up-casted to IEEE FP32 for computations. The proposed f32x4.relaxed_dot_bf16x8_add_f32x4
instruction too does internal computations in FP32 and produce FP32 result. Only the final result need to be down-casted to BFloat16 for store to memory, this downcast can be done by truncating the high 16 bits of the FP32 number, which is sufficiently accurate if it happens relatively rarely, e.g. at operator boundaries in NN computations.
I would like to add some implementer feedback from JSC/WebKit.
There is a lot of interesting feedback in this thread, and I have a few questions.
Some of this was mentioned in the meeting, but do we have any data about:
In general, we are concerned about instructions that might make deterministic fingerprinting much easier, and also about how they might cause compatibility/interoperability issues for users who browse the web on less-common CPU architectures.
An important part of the Relaxed-SIMD proposal for us in general is the fact that it justifies this risk with very compelling performance data. We would not consider implementing BFloat16 unless we have proof that this would speed up an entire class of reasonably popular programs.
Overall, since Relaxed-SIMD is in stage 3, we think that this change should be put in a separate proposal.
Thanks for the feedback @justinmichaud (and great to see JSC/WebKit being more active in WAsm SIMD)!
The performance/speedup of using this instruction, particularly on non-AVX512 intel chips?
Performance on x86 CPUs without AVX512-BF16 extension would be similar to software emulation of the f32x4.relaxed_dot_bf16x8_add_f32x4
with Relaxed SIMD instructions. It would still be faster than end-to-end FP32 processing if the computation is bandwidth-bound on the machine. Note that when BF16 format was originally introduced, it was handled purely in software; hardware support arrived years later.
On Galaxy S22 with Snapdragon 8 Gen 1 I see throughput of 4 BFDOT
/cycle (Cortex-X2 cores) / 2 BFDOT
/cycle (Cortex-A710 cores) / 1 BFDOT
/cycle (Cortex-A510 cores). This is the same throughput as for single-precision FMLA
instructions, but each BFDOT
instruction produce twice more results, leading to twice higher peak performance.
How often we expect this instruction to be useful? As in, how general is it?
This instruction is known to be helpful for neural network computations (both for inference and training). Background Effects in Google Meet and Teachable Machine are examples of applications using such functionality.
Are there any other mainstream programming languages or libraries that support it?
The native VDPBF16PS
and BFDOT
instructions are exposed in C/C++ as intrinsic functions _mm512_dpbf16_ps
and vbfdotq_f32
(+variants of these). Among machine learning frameworks, both TensorFlow and PyTorch support BFloat16 data type (with emulation in lieu of hardware support).
we are concerned about instructions that might make deterministic fingerprinting much easier Note that the implementation using
BFMLALB
/BFMLALT
instructions on ARM is bitwise-compatible with pre-BF16 emulation path.Overall, since Relaxed-SIMD is in stage 3, we think that this change should be put in a separate proposal.
Unlike native platforms, WebAssembly doesn't have the capability to detect supported WebAssembly ISA extensions at run-time: if a WAsm engine doesn't support any instruction in a WAsm module, the whole will be rejected, even if the instruction is never executed in runtime. Unfortunately, this means that WebAssembly extensions can't be as granular as native ISA extensions: developers have to rebuild the whole webapp for each set of WAsm extensions they target, and having an extension with a single instruction in it is overburdening for the developers.
Please note that it is permissible to add new instructions at stage 3 (WebAssembly SIMD added dozens of instructions while at Stage 3). It is Stage 4 is where the spec is frozen.
Thanks for the fast response!
On Galaxy S22 with Snapdragon 8 Gen 1 I see throughput of 4
BFDOT
/cycle (Cortex-X2 cores) / 2BFDOT
/cycle (Cortex-A710 cores) / 1BFDOT
/cycle (Cortex-A510 cores). This is the same throughput as for single-precisionFMLA
instructions, but eachBFDOT
instruction produce twice more results, leading to twice higher peak performance.
Thanks! Is there a real benchmark or library that is available to be compiled with this instruction vs just the normal set of relaxed-simd instructions? The two examples that you linked seem interesting, but I don't see how they could be ported today.
I am also wondering if there is an implementation using BFMLALB/BFMLALT that is complete enough to collect real-world performance numbers, since that is the only implementation we would consider on ARM.
We really need to see data in the context of a full benchmark or application here, because on our CPUs instruction throughput doesn't usually indicate general performance.
Note that the implementation using
BFMLALB
/BFMLALT
instructions on ARM is bitwise-compatible with pre-BF16 emulation path.
This is good to know. So if I am understanding things correctly, the current state of the world is that this is software emulated on every single chip that is available today except for ARMv8+, and ARMv8+ can only be distinguished from emulation by performance (which is fine)?
Please note that it is permissible to add new instructions at stage 3 (WebAssembly SIMD added dozens of instructions while at Stage 3). It is Stage 4 is where the spec is frozen.
This makes sense, but I would just like to note that this instruction seems like a pretty big departure from the others in the proposal. In addition, it seems like there are almost no desktop CPUs that natively support this instruction yet, so maybe it is too soon to standardize?
Is there a real benchmark or library that is available to be compiled with this instruction vs just the normal set of relaxed-simd instructions?
Alibaba's MNN and Tencent's NCNN support BF16 computations (via software conversions to/from FP32), but AFAICT neither use the ARM BF16 extension yet. It is also worth noting that native software targeting ARM BF16 is unlikely to use BFDOT
or BFMLALB
+ BFMLALT
pair, because BF16 extension also includes a more powerful BFloat16 matrix multiplication instruction BFMMLA
. The latter, however, is too exotic and hard to emulate to be exposed in WebAssembly.
My plan is to implement BF16 GEMM (matrix-matrix multiplication) kernels in XNNPACK using generic NEON / BFDOT
/ BFMLALB
+ BFMLALT
pair to evaluate the potential of the f32x4.relaxed_dot_bf16x8_add_f32x4
instruction. GEMM is responsible for 50-80% of runtime in the convolutional neural network (now-common for Computer Vision tasks) and often 90%+ of runtime in the transformer-type neural networks (now-common in Natural Language Processing and replacing CNNs for Computer Vision), so it is a good proxy for the overall neural network inference performance.
So if I am understanding things correctly, the current state of the world is that this is software emulated on every single chip that is available today except for ARMv8+, and ARMv8+ can only be distinguished from emulation by performance (which is fine)?
This is correct. To be specific, BF16 extension on ARM was introduced as optional in ARMv8.2-A and became mandatory in ARMv8.6-A.
This makes sense, but I would just like to note that this instruction seems like a pretty big departure from the others in the proposal. In addition, it seems like there are almost no desktop CPUs that natively support this instruction yet, so maybe it is too soon to standardize?
I see little risk in adding this instruction, as BF16 extension is mandatory in ARMv8.6-A (and ARMv9), and thus eventually all ARM-based computers will have it. The latest Windows-based laptops (e.g. Lenovo ThinkPad X13s) already support BF16 extension. In the desktop x86 space, AMD Zen 4 is reported to support AVX512-BF16, so Intel will have to match to compete.
However, if we don't add this instruction, WebAssembly SIMD will miss out on the nearly 2X speedup for AI models, which would put privacy-focused technologies like Federated Learning as risk.
I see little risk in adding this instruction, as BF16 extension is mandatory in ARMv8.6-A (and ARMv9), and thus eventually all ARM-based computers will have it. The latest Windows-based laptops (e.g. Lenovo ThinkPad X13s) already support BF16 extension. In the desktop x86 space, AMD Zen 4 is reported to support AVX512-BF16, so Intel will have to match to compete.
To reiterate an important point to make sure it doesn't get lost in a long discussion: we need to hear back from engine maintainers before we conclude those instructions are available to the engines. Implementing support for new ISA extension requires implementing new decode, operands, etc, which comes with maintenance costs. I have brought this up before and it did not look like there was interest (I personally would welcome this changing 😉).
please add some engine implementor feedback on this instruction. specifically, this instruction requires AVX512-BF16 extension on the x86 side, and ARMv8.6-A with BF16 extension on the ARM side. Please comment on possible support for these instructions.
I opened the issue at https://bugzilla.mozilla.org/show_bug.cgi?id=1778751 with intent to implement the reference lowering in SpiderMonkey. We feel that there is a need to support BFloat16 for matrix multiplication / ML use cases. In the future we will adjust the solution for ARM (and AVX) BF16 extensions as they become popular.
There is inconsistency in arguments ordering. For fma(a, b, c)
we are doing result = a + b * c
, but for relaxed_dot_bf16x8_add_f32(a, b, c)
we changed the accumulator placement result = dot(a, b) + c
. Shall we keep a
as running sum?
(Also reference lowering has typo related to this)
@yurydelendik Good point! However, it is fma(a, b, c)
that is behaves inconsistently:
fma(a, b, c)
is inconsistent with the same-named C/C++ functionsLets have discussion in the FMA proposal #27
I implemented BF16 GEMM microkernels using FMA
, ARMv8.2 BF16 BFDOT
and ARMv8.2 BF16 BFMLAL[B/T]
instructions in google/XNNPACK#3326 and benchmarked it on the Galaxy S22 Ultra phone with Snapdragon 888 Gen 1 SoC. Below I present results on the M=192
, N=512
, K=512
GEMM from the MobileNet v1 architecture (this GEMM case typically exhibits the best achievable performance among MobileNet v1 GEMMs):
Version | Cortex-A510 | Cortex-A710 | Cortex-X2 |
---|---|---|---|
NEON FMA |
5.79 GFLOPS | 17.5 GFLOPS | 42.2 GFLOPS |
BF16 BFMLALx |
8.22 GFLOPS | 30.5 GFLOPS | 76.9 GFLOPS |
BF16 BFDOT |
13.1 GFLOPS | 59.4 GFLOPS | 122 GFLOPS |
As expected, BFDOT
delivers enormous speedup over the version using FP32 NEON FMA
instructions. Less expected, BFMLALx
instructions which result in bitwise-identical results with the FP32 NEON FMA
microkernels, too deliver meaningful speedup. Below is the summary of performance improvements, assuming FP32 NEON FMA
version as the baseline:
Version | Cortex-A510 | Cortex-A710 | Cortex-X2 |
---|---|---|---|
BF16 BFMLALx |
1.42X speedup | 1.74X speedup | 1.82X speedup |
BF16 BFDOT |
2.26X speedup | 3.39X speedup | 2.89X speedup |
@justinmichaud @yurydelendik @dtig @penzn @abrown Do you have further concerns about the benefits of these instructions? If no, I'd like to move on to voting on the proposal.
Voting sgtm, no concerns specifically but I don't expect that we'd be supporting the BFDOT/AVX512-BF16 (but will support BFMLALB/BFMLALT) lowering right now.
I don't expect that we'd be supporting the BFDOT/AVX512-BF16 (but will support BFMLALB/BFMLALT) lowering right now.
From the bug that @yurydelendik filed they are not planning to support either of the extensions, this doesn't make for good native support, particularly on x86. What is maybe more concerning to me though is that there is no consensus that it is feasible to add AVX512 support to engines in near future - let's say hardware support does appear in a few years, it would still require at least basics of AVX512 implemented in engines.
I would like to second @justinmichaud point about that "future proofing" and maybe expand on it a bit. There is the danger that we won't add semantics that will be introduced in the future, and also there is inertia in changing existing instruction sets that's why it is probably a better strategy to target what is already shipping (one can argue that AVX512_BF16 is already established, but that hits the issue described above).
If we do go forward with this we have to make this explicit somewhere in the proposal what instruction sets this would possibly target. I think there is a discussion of using FMA as baseline, maybe also mention that BF extensions can be targeted if available.
From the bug that @yurydelendik filed they are not planning to support either of the extensions,
Just to note: the initial implementation does not reflect the future intent. Optimizations for ARM or AVX512 extensions will be added as soon as there will be confidence that they will become popular on clients hardware.
What is maybe more concerning to me though is that there is no consensus that it is feasible to add AVX512 support to engines in near future - let's say hardware support does appear in a few years, it would still require at least basics of AVX512 implemented in engines.
@dtig may correct me, but AFAIU the only roadblock to AVX512 support in the engines is the currently low market penetration of the technology, and it would be re-evaluated as AVX512 becomes more widely available.
If we do go forward with this we have to make this explicit somewhere in the proposal what instruction sets this would possibly target. I think there is a discussion of using FMA as baseline, maybe also mention that BF extensions can be targeted if available.
On x86 this instruction can be lowered to SSE2 (separate FP MUL + FP ADD), AVX (same, but 3-operand forms), FMA3/FMA4 (using FMA instructions instead of FP MUL + FP ADD), or AVX512-BF16. Of course, the further down this list, the better is the performance.
On x86 this instruction can be lowered to SSE2 (separate FP MUL + FP ADD), AVX (same, but 3-operand forms), FMA3/FMA4 (using FMA instructions instead of FP MUL + FP ADD), or AVX512-BF16. Of course, the further down this list, the better is the performance.
What I mean is that we would need BF16 extension in hardware to get speedup with this instruction, also that extension enables more variability (singe instruction on Arm vs the other lowering).
By the way, thank you for the rundown of the proposal. I have two naive follow up questions. Would we need other bfloat16 operations at some point? Also, what is the relationship between BF16 and FP16 and do we expect to go to FP16 at some point?
Would we need other bfloat16 operations at some point?
The only other commonly supported bfloat16 operation is conversion from FP32 to BF16 with rounding to nearest-even. However, this operation is potentially expensive to emulate when not supported in hardware.
Also, what is the relationship between BF16 and FP16 and do we expect to go to FP16 at some point?
The two formats are unrelated. Native instruction sets don't offer even conversion operations between these two floating-point formats (although all conversions can be done through FP32 intermediate without loss of accuracy).
Also, what is the relationship between BF16 and FP16 and do we expect to go to FP16 at some point?
The two formats are unrelated. Native instruction sets don't offer even conversion operations between these two floating-point formats (although all conversions can be done through FP32 intermediate without loss of accuracy).
Do you think we would need to support IEEE half precision at some point? If I understand this correctly, like with the bfloat16, there are fp16 extensions available or proposed, at least technically: AVX512 extension on x86, Arm v7 extension, and Arm v8.
Do you think we would need to support IEEE half precision at some point?
I think we'd need to support IEEE half-precision conversion, as hardware coverage for that is very good and efficient software emulation is available for SSE4/AVX1 processors. I don't expect support for half-precision arithmetics in WAsm any time soon because it is well-supported only on recent ARM64 processors (x86 has AVX512-FP16 specification, but no implementations, RISC-V doesn't even have a specification), and software emulation is very inefficient.
Please vote on the inclusion of the f32x4.relaxed_dot_bf16x8_add_f32
BFloat16 Dot Product instruction into the Relaxed SIMD proposal below:
:+1: For including the BFloat16 Dot Product instruction :-1: Against including the BFloat16 Dot Product instruction
A more nuanced vote: as the option isn't presented in the vote. I vote neutral. If merged into the spec, we'd support the BFMLALB/BFMLALT lowering on Arm at this time, because it's exposing the same level of entropy as exposing FMA. Not ruling out supporting the BFDOT or the AVX512 lowering in the future, but no plans to do so at this time.
Given the votes on https://github.com/WebAssembly/relaxed-simd/issues/77#issuecomment-1226191806 and https://github.com/WebAssembly/relaxed-simd/issues/77#issuecomment-1227820909 (I take the "thumbs-up" on Deepti's comment as a neutral vote), we have majority in support of adding this instruction.
@justinmichaud if you wish, please expand more (beyond what you have mentioned above) on your vote against.
I thought I would have a chance to vote tomorrow and talk about this but since that is cancelled I would like to discuss it here. My vote depends on whether the x86 version will ever be implementable on major engines like V8. IIRC, there is no way to encode EVEX instructions (the ones specified in AVX512-related features) in V8 and I sensed that encoding was not a real possibility. If that is actually the case, then I would vote no 👎 to this instruction since there is no feasible path to implement this in V8, a major engine. If in fact the lack of support to encode EVEX instructions was only a temporary circumstance and V8 would accept a patch with the EVEX encoding for this instruction, then I would vote yes 👍. Can someone from the V8 team help me understand what the situation is with EVEX support?
@abrown There's two separate issues which are the entropy exposed by the native Dot product extensions (both with VDPBF16PS, and BFDOT), and the second issue being that these instructions are only available on a small subset of newer hardware.
@dtig may correct me, but AFAIU the only roadblock to AVX512 support in the engines is the currently low market penetration of the technology, and it would be re-evaluated as AVX512 becomes more widely available.
I'd like to explicitly second @Maratyszcza point above, when there's enough devices in the wild that use them, that would be a good reason to support these instructions being generated in V8. Though from an engine perspective, I'd like to make sure we do this consistently instead of adding adhoc encodings for some subset of SIMD operations. So the tl;dr is though we might not add these encodings at this time (both AVX512-BF16, BF16), I wouldn't rule out supporting them in the future.
This is now implemented in LLVM/Clang.
Regarding ARM BF16 extension on Apple processors: according to apple/llvm-project@677da09d0259d7530d32e85cb561bee15f0066e2, BF16 extension is already supported on Apple A15, A16, and M2 processors.
Introduction
BFloat16 is a 16-bit floating-point format that represents the IEEE FP32 numbers truncated to the high 16 bits. BFloat16 numbers have the same exponential range as the IEEE FP32 numbers, but way fewer mantissa bits, and is primarily used in neural network computations which are very tolerable to reduced precision. Despite being a non-IEEE format, BFloat16 computations are supported in the recent and upcoming processors from Intel (Cooper Lake, Alder Lake, Sapphire Rapids), AMD (Zen 4), and ARM (Cortex-A510, Cortex-A710, Cortex-X2), and can be efficiently emulated on older hardware by zero-padding BFloat16 numbers with zeroes to convert to IEEE FP32.
Both x86 and ARM BFloat16 extensions provide instructions to compute an FP32 dot product of two two-element BFloat16 elements with addition to the FP32 accumulator. However, there're some differences in details:
x86 introduced support for BFloat16 computations with the AVX512-BF16 extension, and the BFloat16 dot production functionality is exposed via the
VDPBF16PS
(Dot Product of BF16 Pairs Accumulated into Packed Single Precision) instruction. The instructionVDPBF16PS dest, src1, src2
computesfor each 32-bit lane
i
in the accumulator registerdest
. Additionally, denormalized numbers in inputs are treated as zeroes and denormalized numbers in outputs are replaced with zeroes.ARM added support for BFloat16 computations with the ARMv8.2-A
BFDOT
instruction (mandatory with ARMv8.6-A), which implements the following computation:where all multiplications and additions are unfused and use non-IEEE Round-to-Odd rounding mode. Additionally, denormalized numbers in inputs are treated as zeroes, denormalized numbers in outputs are replaced with zeroes.
ARMv9.2-A introduced additional "Extended BFloat16" (EBF16) mode, which modifies the behavior of
BFDOT
as follows:where
temp.fp32[i]
is calculated as a fused sum-of-products operations with only a single (standard, Round-to-Nearest-Even) rounding at the end.Furthermore, ARMv8.6-A introduced
BFMLALB
/BFMLALT
instructions which perform widening fused multiply-add of even/odd BFloat16 elements to an IEEE FP32 accumulator. A pair of these instructions can be similarly used to implement BFloat16 Dot Product primitive.What are the instructions being proposed?
I propose a 2-element dot product with accumulation instruction with BFloat16 input elements and FP32 accumulator & output elements. The instruction has relaxed semantics to allow lowering to native
VDPBF16PS
(x86) andBFDOT
(ARM) instructions, as well as FMA instructions where available.I suggest
f32x4.relaxed_dot_bf16x8_add_f32x4
as a tentative name for the instruction. In a sense it is a floating-point equivalent of thei32x4.dot_i16x8_add_s
instruction proposed in WebAssembly/simd#127.What are the semantics of these instructions?
y = f32x4.relaxed_dot_bf16x8_add_f32(a, b, c)
computesy.fp32[i] = y.fp32[i] + cast<fp32>(a.bf16[2*i]) * cast<fp32>(b.bf16[2*i]) + cast<fp32>(a.bf16[2*i+1]) * cast<fp32>(b.bf16[2*i+1])
The relaxed nature of the instruction manifests in several allowable options:
Evaluation ordering options
We permit two evaluation orders for the computation:
cast<fp32>(a.bf16[2*i]) * cast<fp32>(b.bf16[2*i]) + cast<fp32>(a.bf16[2*i+1]) * cast<fp32>(b.bf16[2*i+1])
in the first step, then compute the sum withy.fp32[i]
in the second step.y.fp32[i] += cast<fp32>(a.bf16[2*i]) * cast<fp32>(b.bf16[2*i])
in the first step, then computey.fp32[i] += cast<fp32>(a.bf16[2*i+1]) * cast<fp32>(b.bf16[2*i+1])
in the second step.Fusion options
Rounding options
How will these instructions be implemented?
x86/x86-64 processors with AVX512-BF16 instruction set
c = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered toVDPBF16PS xmm_c, xmm_a, xmm_b
y = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered toVMOVDQA xmm_y, xmm_c
+VDPBF16PS xmm_c, xmm_a, xmm_b
ARM64 processors with BF16 extension (using
BFDOT
instructions)y = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered toMOV Vy.16B, Vc.16B
+BFDOT Vy.4S, Va.8H, Vb.8H
c = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered toBFDOT Vc.4S, Va.8H, Vb.8H
ARM64 processors with BF16 extension (using
BFMLALB
/BFMLALT
instructions)y = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered toMOV Vy.16B, Vc.16B
+BFMLALB Vy.4S, Va.4H, Vb.4H
+BFMLALT Vy.4S, Va.8H, Vb.8H
c = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered toBFMLALB Vc.4S, Va.4H, Vb.4H
+BFMLALT Vc.4S, Va.8H, Vb.8H
Reference lowering through the WAsm Relaxed SIMD instruction set
y = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered to:const v128_t a_lo = wasm_i32x4_shl(a, 16)
const v128_t b_lo = wasm_i32x4_shl(b, 16)
const v128_t a_hi = wasm_v128_and(a, wasm_i32x4_const_splat(0xFFFF0000))
const v128_t b_hi = wasm_v128_and(b, wasm_i32x4_const_splat(0xFFFF0000))
y = f32x4.relaxed_fma(a_lo, b_lo, c)
y = f32x4.relaxed_fma(a_hi, b_hi, y)
How does behavior differ across processors? What new fingerprinting surfaces will be exposed?
The use of AVX512-BF16
VDPBF16PS
instruction can be detected through testing for behavior on denormal inputs and outputs. The use of ARMBFDOT
instruction can be detected through testing for behavior on denormal inputs and outputs, or testing for Round-to-Odd rounding. However, an ARM implementation using a pair ofBFMLALB
/BFMLALT
instructions would be indistinguishable from a generic lowering using FMA instructions (but probably slower thanBFDOT
lowering).What use cases are there?