Open innir opened 3 years ago
That would require a rewrite of all manually vectorized code because currently all the cpu feature detections (some sse4 and some fma features) are done at compile time at instruction level and on x86_64 __SSE2__
is always defined.
For auto vectorized code this approach would not give a speed boost.
Users who want maximum speed should use a native build (compile it).
Well, from what I understand
#ifdef __SSE2__
/* some fancy SSE2 code here */
#else
/* some slow code here */
#endif
would become something like
#include "cpuinfo_x86.h"
using namespace CpuFeatures;
static const X86Features features = GetX86Info().features;
if (features.sse2) {
/* some fancy SSE2 code here */
} else {
/* some slow code here */
}
or do I just misunderstand the concept?
And sure, this sleef
stuff might be a little tricky - so not sure if it's worth it ...
Well, from what I understand
The point is, that for x86_64 builds __SSE2__
always is defined. Why wasting time for changing code for cpu detection?
The point is, that for x86_64 builds
__SSE2__
always is defined. Why wasting time for changing code for cpu detection?
True, SSE2 was just an example - for SSE4, AVX, ... it could actually make a difference. But if you don't like the idea, fine - feel free to close the ticket.
It could be useful if we ever need to squeeze out more speed in a more general way, but who would really benefit from that? Apart from the required code to rewrite it would make us depend on another library too, which may not be ideal as well?
@innir
About SSE4: afaik most modern cpus support SSE4. So making pre-compiled SSE4 versions additional to the generic versions would be an option.
About AVX: there it gets complicated. For example on my AMD FX8350 using all cores with AVX code is slower than using all cores with SSE code because the 8-core FX8350 has 8 SSE units (one per core), but only 4 AVX units (one per 2 cores)
x265 is something like 70% assembly and detects all available extensions at runtime, then allows disabling whichever ones you want via command line options for edge cases like the horridly reduced speed of Skylake Xeons running AVX512 code. Encoding H.265 is far more processor intensive than processing raw files though, so I'd question its use here.
Hi,
I was stumbling over https://github.com/google/cpu_features and thought it might be a good idea to replace (all) the
#ifdef __CPUFEATURE__
s with runtime checks. This could give a speed boost for users who run pre-compiled versions of rawtherapee on modern CPUs.I haven't checked out if it's actually feasible but I would if people think it's worth a try.
Best Philip