Performance regression with default Meson build settings

libcg commented 6 years ago

It looks like the default build settings of Meson for the Release variant are causing a performance regression since commit b6e40bc that enabled unity builds. The combination of unity or LTO with -O3 reduces the framerate in Trackmania by about 20% compared to any other combination (-O3/unity off, -O2/unity off, -O2/unity on).

Here are some benchmarks when building DXVK using package-release.sh:

                -O3         -O2         info
trackmania      80.4        102.8       900p, maxed out
heaven          104.1       107.9       1080p, AA off, tess off
valley          76          76.3        1080p, AA off

Until meson is using saner defaults (issue tracked here https://github.com/mesonbuild/meson/issues/3566), we should override them and build with -O2 to avoid the performance penalty. ~-Wl,-O1 is also passed to the linker by default, which doesn't look optimal.~

System information

GPU: R9 Fury
Driver: mesa-git 0748383a
Wine version: wine-staging 3.8
DXVK version: 7f619d9
Compiler: GCC 8.1.0

ssorgatem commented 6 years ago

Relevant discussion: https://stackoverflow.com/questions/11546075/is-optimisation-level-o3-dangerous-in-g

From there:

In my somewhat checkered experience, applying -O3 to an entire program almost always makes it slower (relative to -O2), because it turns on aggressive loop unrolling and inlining that make the program no longer fit in the instruction cache. For larger programs, this can also be true for -O2 relative to -Os!

The intended use pattern for -O3 is, after profiling your program, you manually apply it to a small handful of files containing critical inner loops that actually benefit from these aggressive space-for-speed tradeoffs. With very recent GCC, I think the shiny new link-time profile-guided optimization mode can selectively apply the -O3 optimizations to hot functions -- effectively automating this process.

Also:

-O3 and especially additional flags like -funroll-loops (not enabled by -O3) can sometimes lead to more machine code being generated. Under certain circumstances (e.g. on a cpu with exceptionally small L1 instruction cache) this can cause a slowdown due to all the code of e.g. some inner loop now not fitting anymore into L1I. Generally gcc tries quite hard to not to generate so much code, but since it usually optimizes the generic case, this can happen. Options especially prone to this (like loop unrolling) are normally not included in -O3 and are marked accordingly in the manpage. As such it is generally a good idea to use -O3 for generating fast code, and only fall back to -O2 or -Os (which tries to optimize for code size) when appropriate (e.g. when a profiler indicates L1I misses).

If you want to take optimization into the extreme, you can tweak in gcc via --param the costs associated with certain optimizations. Additionally note that gcc now has the ability to put attributes at functions that control optimization settings just for these functions, so when you find you have a problem with -O3 in one function (or want to try out special flags for just that function), you don't need to compile the whole file or even whole project with O2.

ssorgatem commented 6 years ago

Maybe it could default to the "plain" build type, instead of "release"? Can you test building with "plain" build type to see how it compares?

pchome commented 6 years ago

@libcg

-Wl,-O1 is also passed to the linker by default, which doesn't look optimal.

man ld :

-O level
           If level is a numeric values greater than zero ld optimizes the output.  This might take significantly longer and therefore
           probably should only be enabled for the final binary.  At the moment this option only affects ELF shared library generation.
           Future releases of the linker may make more use of this option.  Also currently there is no difference in the linker's
           behaviour for different non-zero values of this option.  Again this may change with future releases.

You can use -march=native to enable architecture specific optimizations on your side, meanwhile releases on github should be generic.

Also it depends on GCC version, recent GCC releases more stable with enabled -O3 than 5 years ago, when discussion mentioned by @ssorgatem takes place. In other worlds you should provide more information (uname -a / gcc -v / env | grep FLAGS / flags from your build.ninja or compile_commands.json / test against current github release / ...) to be sure this regression caused by -O3.

pchome commented 6 years ago

@libcg

The combination of unity or LTO with -O3

LTO don't mean faster, it can be faster on projects with lot of duplicate code and modules (for example) but also known to slow down some projects. Also you need to cook it properly e.g. pass compiler flags to both C*FLAGS and LDFLAGS, otherwise you'll get unoptimized binaries.

I'm not sure meson lto/pgo options works as expected, so better to control it manually by creating build-win64-lto.txt contains something like:

[properties]
c_args = ['-march=native', '-O3', '-flto', '-fuse-linker-plugin']
c_link_args = ['-march=native', '-O3', '-flto', '-fuse-linker-plugin', '-static', '-static-libgcc']

cpp_args = ['-march=native', '-O3', '-flto', '-fuse-linker-plugin']
cpp_link_args = ['-march=native', '-O3', '-flto', '-fuse-linker-plugin', '-static', '-static-libgcc', '-static-libstdc++']

Also this should work with e.g.

export CFLAGS="-march=native -O3 -flto -fuse-linker-plugin -pipe"
export CXXFLAGS="${CFLAGS}"
export LDFLAGS="-Wl,--as-needed  -Wl,-O1 ${CFLAGS}"

libcg commented 6 years ago

@pchome

This isn't a discussion about squeezing the last bits of performance using build flags, but rather about the perf regression when combining -O3 and whole program optimizations (whether unity build or LTO).

I know that LTO is not a magic bullet, and I could confirm it during my testing. I didn't see any meaningful performance difference in combination with -O2, but the regression was there with -O3. The best performing build I've seen was -O2 + unity, but it was within margin of error from the rest.

I added compiler info in the first post. Thanks for finding that linker doc, it looks like -Wl,-O1 is a sane default.

I'd also like to point out that all testing was done with 32-bit games, maybe 64-bit builds are not affected.

pchome commented 6 years ago

@libcg

This isn't a discussion about squeezing the last bits of performance using build flags

So I asking you to provide more info to be sure problem not on your side.

I'm NOT asking you to build it optimized, rather to NOT use ANY additional compile flags in your test, but if you do - DO IT RIGHT (that's provided information for)!

SveSop commented 6 years ago

@libcg I am not really able to replicate the issue here tbh. Did some testing with https://github.com/doitsujin/dxvk/commit/fb11acbc9139ea2567a4b60ff1ea5ff9330c9e8c, and some flag-tips from @pchome .

Build system:
Ubuntu 18.04
Meson: 0.45.1
Ninja: 1.8.2
Mingw-w64: 7.3.0-11ubuntu1+20.2build1
gcc/g++: 8.0.1
All tests done with unity enabled

Default meson build with no extra flags:

Valley: 88.3 (3693) 35.7/162.7
Heaven: 82.2 (2072) 29.6/156.8

Build with:

export CFLAGS="-march=native -O2 -pipe"
export CXXFLAGS="${CFLAGS}"
export LDFLAGS="-Wl,--as-needed -Wl,-O1 ${CFLAGS}"

For build-win32.txt:

[properties]
c_args = ['-march=native', '-O2']
c_link_args = ['-march=native', '-O2', '-static', '-static-libgcc']
cpp_args = ['-march=native', '-O2']
cpp_link_args = ['-march=native', '-O2', '-static', '-static-libgcc', '-static-libstdc++', '-Wl,--add-stdcall-alias,--enable-stdcall-fixup']

For build-win64.txt

[properties]
c_args = ['-march=native', '-O2']
c_link_args = ['-march=native', '-O2', '-static', '-static-libgcc']
cpp_args = ['-march=native', '-O2']
cpp_link_args = ['-march=native', '-O2', '-static', '-static-libgcc', '-static-libstdc++']

Valley: 88.1 (3686) 35.6/161.0
Heaven: 82.1 (2067) 29.7/155.8

Same as above but with -O3 changed.

Valley: 88.2 (3688) 35.7/162.4
Heaven: 82.0 (2065) 29.6/155.8

I might have done something wrong tho, but tried my best :) Let me know if other stuff should have been tested tho?

libcg commented 6 years ago

@SveSop can you test with and without add_project_arguments('-O2', language : 'cpp') added in the else block of meson.build line 10? that's all you should need to change. then build using the package_release.sh script. testing using lower resolutions should help highlight the CPU bottleneck.

pchome commented 6 years ago

Ok, let me clarify some moments:

I mentioned -march=native because L1 cache was mentioned. $ /usr/bin/x86_64-w64-mingw32-gcc -E -v - </dev/null 2>&1 | grep cc1 vs $ /usr/bin/x86_64-w64-mingw32-gcc -march=native -E -v - </dev/null 2>&1 | grep cc1

There is params like --param l1-cache-size=64 for second variant, so GCC can do something if mentioned problems occurs (I think so) with more aggressive optimizations.
Yes, Mingw-w64: 7.3.0-11ubuntu1+20.2build1 used for build, not system gcc/g++: 8.0.1. I meant Mingw GCC for gcc -v. /usr/bin/x86_64-w64-mingw32-gcc -v should show detailed information about MinGW GCC used for build, which sometimes meter, but lets forget it for now.

But version meter, because some flags changed for -O2/-O3 in 8.0.1 compared to 7.3.0, also there can be regressions in new 8.0.1, and 7.3.0 can be ok with the same flags.
Lets assume meson's LTO and PGO options not work as expected, depending on system it may require additional configuration and/or produce unexpected results ( https://github.com/InBetweenNames/gentooLTO#a-note-about-the-gcc-lto-plugin for reference ). Under "unexpected results" I mean it can produce wrongly optimized binaries even if it compiles, so assuming such builds are wrong unless one can prove they don't.
Unity builds ( http://mesonbuild.com/Unity-builds.html#unity-builds ):

Unity builds can also lead to faster code, because the compiler can do more aggressive optimizations (e.g. inlining).

Looks like -finline-functions the only "*inline*" diff for -O2 vs -O3.

For more information what flags applied for your build you can use
```
cd /tmp
touch empty.c && /usr/bin/x86_64-w64-mingw32-gcc -O2 -S -fverbose-asm empty.c && cat empty.s | less
touch empty.c && /usr/bin/x86_64-w64-mingw32-gcc -O3 -S -fverbose-asm empty.c && cat empty.s | less
```

pchome commented 6 years ago

I did some testing by my own and I can't say there is 20% difference for different option combinations.

But I noticed meson ignoring *FLAGS env vars in favour of cross-file's c_args and cpp_args which are empty.

If there is no big difference between -O2 and -O3 in tests the best decision will be to bring back compiler options to cross-files, e.g.

[properties]
; -falign-functions (-O2 and higher) required for Overwatch
; -O3 can cause 20% fps-drop for Trackmania
c_args = ['-O2']
c_link_args = ['-Wl,-O1', ...]

and set buldtype=plain.

Those want to play with optimization levels can create own cross-files in $XDG_DATA_HOME/meson/cross ( ~/.local/share/meson/cross ) ( http://mesonbuild.com/Cross-compilation.html#cross-file-locations ), e.g. dxvk-win64-appname

[properties]
c_args = ['-march=native', '-O3', ... ]
c_link_args = ['-Wl,-O1', ...]

SveSop commented 6 years ago

/usr/bin/x86_64-w64-mingw32-gcc -v

Target: x86_64-w64-mingw32
Thread model: posix
gcc version 7.3-posix 20180312 (GCC)

Dont really have too much time testing compiler options atm tho.

ssorgatem commented 6 years ago

Meson is (correctly) ignoring the CFLAGS environment variables because it's cross-building, and in such cases the cross-build and host environments should be isolated from each other (so the cross-building doesn't pick up any host flags that may not be applicable to the target, and the host environment doesn't get polluted by foreign flags).

So the only way to change the flags used is either adding them as global project parameters or specifying them in the cross-build file, which exists specifically for that.

jarrard commented 6 years ago

Tried this, maybe Fallout4 gets a slight boost, but not noticing much change from my medium+4k settings in KCD, still 35-38 fps. Think I get about 60fps or so under Windows, KCD has a crashbug atm at start so until thats sorted there isn't much point worrying about it.

add_project_arguments('-O2', language : 'cpp')

doitsujin / dxvk

Performance regression with default Meson build settings #386

System information