Tencent / ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform
Other
20.34k stars 4.16k forks source link

F16C is not actually used #2037

Open Artoria2e5 opened 4 years ago

Artoria2e5 commented 4 years ago

The float32/float16 conversion routine in ncnn is written using a ton of bit-meddling magic. This compiles to some very long stuff, as the compiler is not able to recognize something of this complexity as a "simple" conversion (_cvtsh_ss).

On the other hand, the source code does mention f16c a few times with the understanding that it should be part of AVX2. In that case, someone should probably write a version that wraps around the intrinsics. (The ARM side is properly covered with vcvt_f16_f32.)

Artoria2e5 commented 4 years ago

Only clang recognizes __fp16 on non-NEON platforms. It can't really vectorize the loop, but that's still better than a lot of branches spent on conversion.

GCC seems to have a __gnu_f2h_ieee, but it's a lot more cautions on NaN.

baryluk commented 4 years ago

There is a number of extensions on x86: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=f16 and https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=fp16 , some in AVX512BF16, some in separate extension F16C (that are also available on AMD CPUs).

_mm256_cvtps_ph , _mm256_cvtph_ps would be the two most useful ones to use. They use standard IEEE half float format. F16C only provides conversions between single precision and half floats. Which is plenty enough for many uses cases.

The AVX512-BF16 uses different format (bfloat16). Not only it provides support for conversions, it also provides support for doing arithmetic, dot products, and such, including bfloat16 or fp32 accumualtors AFAIK.

I heard, that in clang you can pass -fnative-half-type -fallow-half-arguments-and-returns, and it will support __fp16 (as also more standarized _Float16), on x86. But I didn't have success with that.

Here is a bit more optimized code for manual conversion: https://gist.github.com/martin-kallman/5049614

Produce code is really decent: https://godbolt.org/z/ez5a6T still not the best thing to inline, as it is somehow meh, and has a branch in it (to handle denormals) when using GCC.

It doesn't handle NaN or Infinity properly tho.

Artoria2e5 commented 4 years ago

AVX-512 can be more of a problem, since that's requiring yet another subarch. Given how architectures are currently handled -- with AVX2 needing special accommodation in cmake instead of just a normal march -- I am a bit pessimistic. Granted, it's really just some replacing the AVX2 macro with the standard __AVX2__ required.

My issue with the current API is that it bodes poorly with the multiple-value-as-once intrinsics, so I would really hope for a compiler to do it for me. The two functions convert a single value, but they are both heavily used in loops, uh! Doing it one at a time might still be faster though, compared to the whole branching and stuff.

As for NaN and infinity in bf16, I doubt we are doing it correctly right now. It can't be that much worse.