Open bassmang opened 10 months ago
google benchmark should also show similar results, but it needs to prevent optimizations carefully.
however, in general we could just rely on compiler auto-vectorization to optimize this. Using -O3 -ffast-math -march=native
can get much closer to the hand-written intrinsic performance.
#include <benchmark/benchmark.h>
static void BM_inv_sqrt(benchmark::State& state)
{
volatile float x = 4.f;
for (auto _ : state)
{
benchmark::DoNotOptimize(inv_sqrt(x));
benchmark::ClobberMemory();
}
}
static void BM_rsqrt(benchmark::State& state)
{
volatile float x = 4.f;
for (auto _ : state)
{
benchmark::DoNotOptimize(rsqrt(x));
benchmark::ClobberMemory();
}
}
BENCHMARK(BM_inv_sqrt);
BENCHMARK(BM_rsqrt);
BENCHMARK_MAIN();
from GCC doc: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#x86-Options
-mrecip
This option enables use of RCPSS and RSQRTSS instructions (and their vectorized variants RCPPS and RSQRTPS) with an additional Newton-Raphson step to increase precision instead of DIVSS and SQRTSS (and their vectorized variants) for single-precision floating-point arguments. These instructions are generated only when -funsafe-math-optimizations is enabled together with -ffinite-math-only and -fno-trapping-math. Note that while the throughput of the sequence is higher than the throughput of the non-reciprocal instruction, the precision of the sequence can be decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994).
Note that GCC implements 1.0f/sqrtf(x) in terms of RSQRTSS (or RSQRTPS) already with -ffast-math (or the above option combination), and doesn’t need -mrecip.
Also note that GCC emits the above sequence with additional Newton-Raphson step for vectorized single-float division and vectorized sqrtf(x) already with -ffast-math (or the above option combination), and doesn’t need -mrecip.
1.f / std::sqrt(x) has become the default for inverse square root in gd. Tracking issue to add flag to enable sse2 optimization here: #4666
Min repro:
include
include
include
include
include
inline float inv_sqrt(const float x) { return 1.f / std::sqrt(x); }
inline float rsqrt(const float f) { __m128 temp = _mm_set_ss(f); temp = _mm_rsqrt_ss(temp); return _mm_cvtss_f32(temp); }
int main(int argc, char* argv[]) { constexpr size_t num_repeats = 100000000; volatile float in = 4.f; volatile float out;
{ const auto start = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < num_repeats; ++i) { out = inv_sqrt(in); } const auto stop = std::chrono::high_resolution_clock::now();
}
{ const auto start = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < num_repeats; ++i) { out = rsqrt(in); } const auto stop = std::chrono::high_resolution_clock::now();
}
return EXIT_SUCCESS; }
on my laptop, the results are: zwd@zwd-msft:~$ g++ rsqrt.cc -O2 zwd@zwd-msft:~$ ./a.out inv_sqrt: 209601 us 0.5 rsqrt: 34340 us 0.499878
Google benchmarks: https://github.com/VowpalWabbit/vowpal_wabbit/actions/runs/6894788120/job/18757367848