VowpalWabbit / vowpal_wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
https://vowpalwabbit.org
Other
8.47k stars 1.93k forks source link

Option to enable SSE2 optimization #4667

Open bassmang opened 10 months ago

bassmang commented 10 months ago

1.f / std::sqrt(x) has become the default for inverse square root in gd. Tracking issue to add flag to enable sse2 optimization here: #4666

Min repro:

include

include

include

include

include

inline float inv_sqrt(const float x) { return 1.f / std::sqrt(x); }

inline float rsqrt(const float f) { __m128 temp = _mm_set_ss(f); temp = _mm_rsqrt_ss(temp); return _mm_cvtss_f32(temp); }

int main(int argc, char* argv[]) { constexpr size_t num_repeats = 100000000; volatile float in = 4.f; volatile float out;

{ const auto start = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < num_repeats; ++i) { out = inv_sqrt(in); } const auto stop = std::chrono::high_resolution_clock::now();

const auto diff =
    std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "inv_sqrt: " << diff.count() << " us\n";
std::cout << out << std::endl;

}

{ const auto start = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < num_repeats; ++i) { out = rsqrt(in); } const auto stop = std::chrono::high_resolution_clock::now();

const auto diff =
    std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "rsqrt: " << diff.count() << " us\n";
std::cout << out << std::endl;

}

return EXIT_SUCCESS; }

on my laptop, the results are: zwd@zwd-msft:~$ g++ rsqrt.cc -O2 zwd@zwd-msft:~$ ./a.out inv_sqrt: 209601 us 0.5 rsqrt: 34340 us 0.499878

Google benchmarks: https://github.com/VowpalWabbit/vowpal_wabbit/actions/runs/6894788120/job/18757367848

zwd-ms commented 10 months ago

google benchmark should also show similar results, but it needs to prevent optimizations carefully.

however, in general we could just rely on compiler auto-vectorization to optimize this. Using -O3 -ffast-math -march=native can get much closer to the hand-written intrinsic performance.

#include <benchmark/benchmark.h>

static void BM_inv_sqrt(benchmark::State& state)
{
  volatile float x = 4.f;

  for (auto _ : state)
  {
    benchmark::DoNotOptimize(inv_sqrt(x));
    benchmark::ClobberMemory();
  }
}

static void BM_rsqrt(benchmark::State& state)
{
  volatile float x = 4.f;

  for (auto _ : state)
  {
    benchmark::DoNotOptimize(rsqrt(x));
    benchmark::ClobberMemory();
  }
}

BENCHMARK(BM_inv_sqrt);
BENCHMARK(BM_rsqrt);

BENCHMARK_MAIN();
zwd-ms commented 10 months ago

from GCC doc: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#x86-Options

-mrecip

This option enables use of RCPSS and RSQRTSS instructions (and their vectorized variants RCPPS and RSQRTPS) with an additional Newton-Raphson step to increase precision instead of DIVSS and SQRTSS (and their vectorized variants) for single-precision floating-point arguments. These instructions are generated only when -funsafe-math-optimizations is enabled together with -ffinite-math-only and -fno-trapping-math. Note that while the throughput of the sequence is higher than the throughput of the non-reciprocal instruction, the precision of the sequence can be decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994).

Note that GCC implements 1.0f/sqrtf(x) in terms of RSQRTSS (or RSQRTPS) already with -ffast-math (or the above option combination), and doesn’t need -mrecip.

Also note that GCC emits the above sequence with additional Newton-Raphson step for vectorized single-float division and vectorized sqrtf(x) already with -ffast-math (or the above option combination), and doesn’t need -mrecip.