Benchmark AVX2 instructions in the render cache function

kcgen commented 4 years ago

https://github.com/joncampbell123/dosbox-x/pull/1880 by @linkmauve.

Discussion and performance improvement in the link.

dreamer commented 4 years ago

For a moment I thought someone wants to force AVX2 - that would require cutting of 20% of users and would be an instant "nope" from me, but that's not the case :)

As for this PR specifically - yeah I would love to see it ported to dosbox-staging to do performance testing, but I doubt it yields the performance gains.

I doubt it will provide speed boost, because the loop in question is most likely already auto-vectorized by C++ compiler.
I did a similar experiment some time ago (testing Carmageddon and Quake), the performance gain was so close to 0, that I couldn't decide if it's really there (I think exactly because of auto-vectorization). But I might've botched something, so who knows?
The proper solution would utilize SDL2 (e.g. SDL_HasAVX2 to detect available instruction sets at runtime and dispatch to the functions compiled with GCC target attribute or Intel's _allow_cpu_features intrinsic or Intel's cpu_dispatch declspec (no idea how to do this on MSVC; I read somewhere, that clang supports Intel's intrinsics, but I'm not sure). edit Or Rust's built-in target_feature option.

So, the feature: yes, but not in simplistic way being submitted to DOSBox-X (which is manually doing the same job, that compiler does automatically already during auto-vectorization).

Based on flame graphs I did some time ago, I say it could result in max ~1-2% of performance uptick (but an uptick nonetheless) - that's what I was pursuing when testing with Carmageddon and Quake, but I simply couldn't measure any difference.

kcgen commented 4 years ago

I doubt it will provide speed boost, because the loop in question is most likely already auto-vectorized by C++ compiler.

I thought the same here, and author responded:

https://github.com/joncampbell123/dosbox-x/issues/1883#issuecomment-699518260

dreamer commented 4 years ago

This reduces runtime usage of this function from 8.3% to 6.2% overall, when displaying the DOS prompt on my Kaby Lake.

Author says he tested it with perf(1) - which is the same tool I used to generate flamegraphs when testing Quake and Carmageddon. DOS prompt utilizes almost no CPU time - our runtime execution time is spent ~85% in CPU emulation code. That's not saying micro-optimizations are not welcome, just that it might be really, really not worth it… (we should do GCC report to see if this loop is already auto-vectorized or not - there's no reason to speculate about it).

As for micro-benchmark tools, I heard good things about google/benchmark, but we don't have it integrated with dosbox-staging buildsystem yet. And doing micro-benchmarks is not as easy as it seems.

Anyway, if author is interested in discussing it more, then sending PR to our project would be appreciated so we could do some measurements and start working on runtime dispatch method.

kcgen commented 4 years ago

we should do GCC report to see if this loop is already auto-vectorized or not - there's no reason to speculate about it

Attached is a report from a release build plus -fopt-info-all -ftree-vectorizer-verbose=6, which includes both optimized and non-optimized code. (-t optinfo target with the build script).

File is ZStandard compressed, but with .txt extension to pacify github.. unpacks to 42x the size.

opt.zst.txt

I unfortunately was unable to figure this out; possibly too many levels of inlining maybe has eliminated the need to carry this function as a distinct entity?

kcgen commented 3 years ago

Closing.

Torinde commented 4 months ago

Related: Project card, planned features - Added by dreamer

Write documentation about performance tuning: how to create flame graphs and icicle graphs.

dosbox-staging / dosbox-staging

Benchmark AVX2 instructions in the render cache function #627