Ivorforce / NumDot

Tensor math and scientific computation for the Godot game engine.
https://numdot.readthedocs.io
MIT License
19 stars 7 forks source link

Look into optimization compiler flags #56

Closed Ivorforce closed 1 month ago

Ivorforce commented 1 month ago

We should look into what flags we can pass to the compiler to speed up operations.

-ffast-math is a good starting point to look. However, it does enable -ffinite-math-only, which doesn't make sense for us.

I don't really have a background with this kind of stuff so i'm happy to give the issue away to someone who does. The only other thing I read is that for releases, we probably want -o. Though that may be passed for the release target in SConstruct anyway.

switch-blade-stuff commented 1 month ago

I would suggest using -O3 and -flto if these are not already enabled. I recommend against -ffast-math since it will enable aggressive optimizations and can sacrifice precision or even create bugs due to the finite-math-only assumption.

switch-blade-stuff commented 1 month ago

Another thing to look at would be the supported architecture levels. Enabling the more modern instruction sets such as with -mavx2 and -mavx10.2 flags (for x86) will allow for better optimizations but will raise the minimum hardware requirements of the resulting binaries, so this may either require multiple binaries to be compiled for different architectures and picked on install, or a runtime architecture dispatch like so:

// The default
const char *my_func_fallback() { return "fallback"; }
// SSE2 variant
__attribute__((target("sse2"))) const char *my_func_sse2() { return "sse2"; }
// SSE4.2 variant
__attribute__((target("sse4.2"))) const char *my_func_sse4_2() { return "sse4.2"; }
// AVX2 variant
__attribute__((target("avx2"))) const char *my_func_avx2() { return "avx2"; }
// AVX10.2 variant
__attribute__((target("avx10.2"))) const char *my_func_avx10_2() { return "avx10.2"; }

using my_func_t = const char *(*)();
extern "C" my_func_t resolve_my_func() noexcept
{
    __builtin_cpu_init();
    if (__builtin_cpu_supports("avx10.2"))
        return my_func_avx10_2;
    else if (__builtin_cpu_supports("avx2"))
        return my_func_avx2;
    else if (__builtin_cpu_supports("sse4.2"))
        return my_func_sse4_2;
    else if (__builtin_cpu_supports("sse2"))
        return my_func_sse2;
    else
        return my_func_fallback;
}
__attribute__ ((ifunc("resolve_my_func"))) const char *my_func();
Ivorforce commented 1 month ago

Thank you for the tips!

I looked into and tested flags, here's what I found:

Regarding runtime simd checks, I'm afraid xsimd or xtensor would need to support this, and I don't think they do. From the docs:

Notice that this option prevents building on a machine and distributing the resulting binary on another machine with a different architecture (i.e. not supporting the same instruction set).

Finally, I also ran some benchmarks for -O2 vs -Os vs -O3. Let's use O2 as the baseline:

Considering platforms' use-cases, let's default to -O3 for downloadable mediums, and to -Os for web. Custom builds can of course always adjust these defaults.