Beep6581 / RawTherapee

A powerful cross-platform raw photo processing program
https://rawtherapee.com
GNU General Public License v3.0
2.91k stars 325 forks source link

Detect CPU features at runtime instead of at compile time #6028

Open innir opened 3 years ago

innir commented 3 years ago

Hi,

I was stumbling over https://github.com/google/cpu_features and thought it might be a good idea to replace (all) the #ifdef __CPUFEATURE__s with runtime checks. This could give a speed boost for users who run pre-compiled versions of rawtherapee on modern CPUs.

I haven't checked out if it's actually feasible but I would if people think it's worth a try.

Best Philip

heckflosse commented 3 years ago

That would require a rewrite of all manually vectorized code because currently all the cpu feature detections (some sse4 and some fma features) are done at compile time at instruction level and on x86_64 __SSE2__ is always defined. For auto vectorized code this approach would not give a speed boost. Users who want maximum speed should use a native build (compile it).

innir commented 3 years ago

Well, from what I understand

#ifdef __SSE2__
  /* some fancy SSE2 code here */
#else
  /* some slow code here */
#endif

would become something like

#include "cpuinfo_x86.h"
using namespace CpuFeatures;
static const X86Features features = GetX86Info().features;

if (features.sse2) {
  /* some fancy SSE2 code here */
} else {
  /* some slow code here */
}

or do I just misunderstand the concept?

And sure, this sleef stuff might be a little tricky - so not sure if it's worth it ...

heckflosse commented 3 years ago

Well, from what I understand

The point is, that for x86_64 builds __SSE2__ always is defined. Why wasting time for changing code for cpu detection?

innir commented 3 years ago

The point is, that for x86_64 builds __SSE2__ always is defined. Why wasting time for changing code for cpu detection?

True, SSE2 was just an example - for SSE4, AVX, ... it could actually make a difference. But if you don't like the idea, fine - feel free to close the ticket.

Thanatomanic commented 3 years ago

It could be useful if we ever need to squeeze out more speed in a more general way, but who would really benefit from that? Apart from the required code to rewrite it would make us depend on another library too, which may not be ideal as well?

heckflosse commented 3 years ago

@innir

About SSE4: afaik most modern cpus support SSE4. So making pre-compiled SSE4 versions additional to the generic versions would be an option.

About AVX: there it gets complicated. For example on my AMD FX8350 using all cores with AVX code is slower than using all cores with SSE code because the 8-core FX8350 has 8 SSE units (one per core), but only 4 AVX units (one per 2 cores)

ghost commented 2 years ago

x265 is something like 70% assembly and detects all available extensions at runtime, then allows disabling whichever ones you want via command line options for edge cases like the horridly reduced speed of Skylake Xeons running AVX512 code. Encoding H.265 is far more processor intensive than processing raw files though, so I'd question its use here.