Closed Ivorforce closed 1 month ago
I would suggest using -O3
and -flto
if these are not already enabled. I recommend against -ffast-math
since it will enable aggressive optimizations and can sacrifice precision or even create bugs due to the finite-math-only
assumption.
Another thing to look at would be the supported architecture levels. Enabling the more modern instruction sets such as with -mavx2
and -mavx10.2
flags (for x86) will allow for better optimizations but will raise the minimum hardware requirements of the resulting binaries, so this may either require multiple binaries to be compiled for different architectures and picked on install, or a runtime architecture dispatch like so:
// The default
const char *my_func_fallback() { return "fallback"; }
// SSE2 variant
__attribute__((target("sse2"))) const char *my_func_sse2() { return "sse2"; }
// SSE4.2 variant
__attribute__((target("sse4.2"))) const char *my_func_sse4_2() { return "sse4.2"; }
// AVX2 variant
__attribute__((target("avx2"))) const char *my_func_avx2() { return "avx2"; }
// AVX10.2 variant
__attribute__((target("avx10.2"))) const char *my_func_avx10_2() { return "avx10.2"; }
using my_func_t = const char *(*)();
extern "C" my_func_t resolve_my_func() noexcept
{
__builtin_cpu_init();
if (__builtin_cpu_supports("avx10.2"))
return my_func_avx10_2;
else if (__builtin_cpu_supports("avx2"))
return my_func_avx2;
else if (__builtin_cpu_supports("sse4.2"))
return my_func_sse4_2;
else if (__builtin_cpu_supports("sse2"))
return my_func_sse2;
else
return my_func_fallback;
}
__attribute__ ((ifunc("resolve_my_func"))) const char *my_func();
Thank you for the tips!
I looked into and tested flags, here's what I found:
-O3
, using the optimize=speed
option.debug_symbols=false
)-flto
reduces the binary size by a bit (~.5mb)Regarding runtime simd checks, I'm afraid xsimd
or xtensor
would need to support this, and I don't think they do. From the docs:
Notice that this option prevents building on a machine and distributing the resulting binary on another machine with a different architecture (i.e. not supporting the same instruction set).
Finally, I also ran some benchmarks for -O2
vs -Os
vs -O3
. Let's use O2
as the baseline:
-Os
increases runtimes of functions by 2 to 25%. It decreases the binary size by ~35% (from 24mb to 15.6mb on macOS).-O3
decreases runtimes of functions by -3% to 5%. It decreases the binary size by 9% (from 24mb to 22mb on macOS).Considering platforms' use-cases, let's default to -O3
for downloadable mediums, and to -Os
for web. Custom builds can of course always adjust these defaults.
We should look into what flags we can pass to the compiler to speed up operations.
-ffast-math
is a good starting point to look. However, it does enable-ffinite-math-only
, which doesn't make sense for us.I don't really have a background with this kind of stuff so i'm happy to give the issue away to someone who does. The only other thing I read is that for releases, we probably want
-o
. Though that may be passed for the release target in SConstruct anyway.