Open Artoria2e5 opened 4 years ago
Only clang recognizes __fp16
on non-NEON platforms. It can't really vectorize the loop, but that's still better than a lot of branches spent on conversion.
GCC seems to have a __gnu_f2h_ieee, but it's a lot more cautions on NaN.
There is a number of extensions on x86: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=f16 and https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=fp16 , some in AVX512BF16, some in separate extension F16C (that are also available on AMD CPUs).
_mm256_cvtps_ph , _mm256_cvtph_ps would be the two most useful ones to use. They use standard IEEE half float format. F16C only provides conversions between single precision and half floats. Which is plenty enough for many uses cases.
The AVX512-BF16 uses different format (bfloat16). Not only it provides support for conversions, it also provides support for doing arithmetic, dot products, and such, including bfloat16 or fp32 accumualtors AFAIK.
I heard, that in clang
you can pass -fnative-half-type -fallow-half-arguments-and-returns
, and it will support __fp16 (as also more standarized _Float16), on x86. But I didn't have success with that.
Here is a bit more optimized code for manual conversion: https://gist.github.com/martin-kallman/5049614
Produce code is really decent: https://godbolt.org/z/ez5a6T still not the best thing to inline, as it is somehow meh, and has a branch in it (to handle denormals) when using GCC.
It doesn't handle NaN or Infinity properly tho.
AVX-512 can be more of a problem, since that's requiring yet another subarch. Given how architectures are currently handled -- with AVX2 needing special accommodation in cmake instead of just a normal march -- I am a bit pessimistic. Granted, it's really just some replacing the AVX2
macro with the standard __AVX2__
required.
My issue with the current API is that it bodes poorly with the multiple-value-as-once intrinsics, so I would really hope for a compiler to do it for me. The two functions convert a single value, but they are both heavily used in loops, uh! Doing it one at a time might still be faster though, compared to the whole branching and stuff.
As for NaN and infinity in bf16, I doubt we are doing it correctly right now. It can't be that much worse.
The float32/float16 conversion routine in ncnn is written using a ton of bit-meddling magic. This compiles to some very long stuff, as the compiler is not able to recognize something of this complexity as a "simple" conversion (
_cvtsh_ss
).On the other hand, the source code does mention f16c a few times with the understanding that it should be part of AVX2. In that case, someone should probably write a version that wraps around the intrinsics. (The ARM side is properly covered with vcvt_f16_f32.)