InBetweenNames / gentooLTO

A Gentoo Portage configuration for building with -O3, Graphite, and LTO optimizations
GNU General Public License v2.0
571 stars 96 forks source link

Python optimization #20

Closed ThGravo closed 6 years ago

ThGravo commented 7 years ago

When you have a look at the configure output of python there is potential: checking for --enable-optimizations... no and further down the line If you want a release build with all optimizations active (LTO, PGO, etc), please run ./configure --enable-optimizations Do you see a way to incorporate that in your overlay?

pchome commented 7 years ago

https://bugs.gentoo.org/615412

solution: https://forums.gentoo.org/viewtopic-p-8030692.html#8030692 BTW, very interesting thread

darkbasic commented 7 years ago

"Speaking about python, can be build with pgo by default. Well 3.5 version take A LOT more time, but lto, graphite and pgo speed gain is ~25%."

By the way, what about profile-guided optimization? It would be interesting to enable it for as many packages as possible.

Edit:

mkdir -p /etc/portage/env/sys-devel
echo 'BOOT_CFLAGS="-mtune=native -O3 -pipe"' >> /etc/portage/env/sys-devel/gcc
echo 'GCC_MAKE_TARGET="profiledbootstrap"' >> /etc/portage/env/sys-devel/gcc
emerge -1v sys-devel/gcc

According to this PGO should also speed up gcc up to 15% when compiling other packages. That sounds awesome.

ThGravo commented 7 years ago

Does ./configure --enable-optimizations induce make profile-opt?

pchome commented 7 years ago

Python-2.7.13/configure.ac:

# Enable optimization flags
AC_SUBST(DEF_MAKE_ALL_RULE)
AC_SUBST(DEF_MAKE_RULE)
Py_OPT='false'
AC_MSG_CHECKING(for --enable-optimizations)
AC_ARG_ENABLE(optimizations, AS_HELP_STRING([--enable-optimizations], [Enable expensive optimizations (PGO, maybe LTO, etc).  Disabled by default.]),
[
if test "$withval" != no
then
  Py_OPT='true'
  AC_MSG_RESULT(yes);
else
  Py_OPT='false'
  AC_MSG_RESULT(no);
fi],
[AC_MSG_RESULT(no)])
if test "$Py_OPT" = 'true' ; then
  # Intentionally not forcing Py_LTO='true' here.  Too many toolchains do not
  # compile working code using it and both test_distutils and test_gdb are
  # broken when you do managed to get a toolchain that works with it.  People
  # who want LTO need to use --with-lto themselves.
  Py_LTO='true'
  DEF_MAKE_ALL_RULE="profile-opt"
  REQUIRE_PGO="yes"
  DEF_MAKE_RULE="build_all"
else
  DEF_MAKE_ALL_RULE="build_all"
  REQUIRE_PGO="no"
  DEF_MAKE_RULE="all"
fi
pchome commented 7 years ago

to avoid question about --with-lto:

# Enable LTO flags
AC_SUBST(LTOFLAGS)
AC_MSG_CHECKING(for --with-lto)
AC_ARG_WITH(lto, AS_HELP_STRING([--with-lto], [Enable Link Time Optimization in PGO builds. Disabled by default.]),
[
if test "$withval" != no
then
  Py_LTO='true'
  AC_MSG_RESULT(yes);
else
  Py_LTO='false'
  AC_MSG_RESULT(no);
fi],
[AC_MSG_RESULT(no)])
if test "$Py_LTO" = 'true' ; then
  case $CC in
    *clang*)
      # Any changes made here should be reflected in the GCC+Darwin case below
      LTOFLAGS="-flto"
      ;;
    *gcc*)
      case $ac_sys_system in
        Darwin*)
          LTOFLAGS="-flto"
          ;;
        *)
          LTOFLAGS="-flto -fuse-linker-plugin -ffat-lto-objects -flto-partition=none"
          ;;
      esac
      ;;
  esac
fi
InBetweenNames commented 7 years ago

A 25% speedup with Python PGO! Wow! Would certainly help my emerge dependency resolution times :)

PGO is something I have been keeping my eye on and I would like to support it if at all possible. I'm going to guess that only packages with test suites are eligible as the profiling information has to come from somewhere--either that or we would require users to use whichever PGOed package for a day to collect such info. I know Firefox also supports PGO, although I haven't had much luck in building with it.

What do you think about doing PGO on a per-package basis until we iron out the details about how best to do it? I'm also not opposed to including modified ebuilds in the repo that offer PGO as a USE flag for certain packages. I expect we may end up having to do such anyways to fix broken ebuilds before they are merged upstream.

Perhaps to start with we could look at GCC and Python, per the comments in this issue?

darkbasic commented 7 years ago

A 25% speedup with Python PGO! Wow! Would certainly help my emerge dependency resolution times :)

That's exactly what I thought: portage is sooooo slow!

What do you think about doing PGO on a per-package basis until we iron out the details about how best to do it?

Perhaps to start with we could look at GCC and Python, per the comments in this issue?

I completely agree.

I'm also not opposed to including modified ebuilds in the repo that offer PGO as a USE flag for certain packages. I expect we may end up having to do such anyways to fix broken ebuilds before they are merged upstream.

When possibile I would rather prefer to live patch the ebuilds instead. I used to do it when I needed to carry additional patches, but I'm not sure how extensible is this approach.

InBetweenNames commented 7 years ago

Yeah I just switched my own Portage over to git to make it easier to upstream patches. Working on this one right now: https://github.com/gentoo/gentoo/pull/5741

I want to get a "pgo" USE in GCC as I had a successful build today with PGO per your comment above

InBetweenNames commented 7 years ago

I'll be able to take a look at Python more in depth over the next couple of days. It doesn't look super involved thankfully. Unfortunately for me, Glibc 2.26-r1 is actually preventing my rebuild of Python (any version) because it doesn't have the "rpc" stuff in it anymore (it was deprecated and isn't in the USE anymore). I may file a bug upstream about it, as Python seems to hard depend on it. So I would not be able to test any ebuild modifications I make locally.

If anyone does manage to patch the ebuild and submit it upstream, would you mind linking the PR here?

darkbasic commented 7 years ago

Yeah I just switched my own Portage over to git to make it easier to upstream patches. Working on this one right now: gentoo/gentoo#5741

Great! Can you please benchmark it against a couple of packages, measuring how much time they need to be compiled with both versions of gcc (gcc has been compiled with and without PGO)?

InBetweenNames commented 7 years ago

Sure! Do you have any packages in mind? I was thinking Firefox would be a good one.

Althorion commented 7 years ago

Firefox nowadays compiles a lot of Rust, so that’s perhaps not the best idea. I was thinking about the GCC itself.

InBetweenNames commented 7 years ago

True--I hadn't considered Rust. I'll do a simple time emerge sys-devel/gcc with and without PGO and see what happens. I already have the PGOed version installed, so testing will be easy.

darkbasic commented 7 years ago

Possibly a package which scales very well with the number of cores, meaning that it passes most of the time doing real compilation instead of linking etc. For example libreoffice will take a lot of time to compile despite how many cores you will throw at him: the vast majority of the cores will be semi-idles most of the time. gcc itself would probably be a good candidate, but be sure to compile it with the very same flags each time (I suggest you to build it without PGO and LTO). Also be sure to avoid the install phase, possibly using the ebuild binary directly to bypass it.

darkbasic commented 7 years ago

Even better: the linux kernel. Just run time make -j8 in the source directory.

InBetweenNames commented 7 years ago

Wow--barely any difference in GCC. Method:

1) Build GCC with USE="pgo" (use my patch on the PR) 2) time emerge sys-devel/gcc without USE="pgo" (to get the PGO time to build GCC) 3) time emerge sys-devel/gcc without USE="pgo" (to get non-PGO time to build GCC)

PGO:
real 1356.00
user 7310.50
sys 194.84

22.6m
non-PGO:

real 1362.53
user 7369.27
sys 196.55

22.7m

I'll try the linux kernel next

darkbasic commented 7 years ago

Disappointing, this is just a 0.5%, very far from the 15% stated in that thread :( Let's see if things will improve with the kernel, otherwise let's hope that python with PGO will give us greater benefits.

InBetweenNames commented 7 years ago

Indeed! I'm wondering if perhaps GCC is pathological due to the bootstrapping it does. I think the linux kernel will be a better test for sure. Anyone want to try that out?

darkbasic commented 7 years ago

I can't, my dual core laptop isn't powerful enough for my tastes to be able to run Gentoo so I switched to Arch. I plan to buy a Threadripper in a couple of months, it will be fun then. My goal is to be able to rebuild all the packages (with O3, graphite, LTO and PGO) during the night (so in less than 7-8 hours). Probably a Ryzen 7 would be enough for that, but I will make good use of more cores anyway. I like to rebuild @world each night to be able to spot build failures early, to easily pinpoint them to a specific package upgrade. This, in conjuction with snapper's btrfs snapshots, would help me alot to file good bug reports with ease.

pchome commented 7 years ago

python2.7 PGO preview

not sure if it builds correct since it was too fast for profiled build and I need to check output for any errors also please note 1xPass benchmark an CPU governor conservative is active while benchmarking (yes, quick lazy check)

$ cat py2pgo.txt

+------------------------+----------+------------------------------+
| Benchmark              | py2      | py2pgo                       |
+========================+==========+==============================+
| 2to3                   | 1.22 sec | 1.31 sec: 1.07x slower (+7%) |
+------------------------+----------+------------------------------+
| chaos                  | 345 ms   | 370 ms: 1.07x slower (+7%)   |
+------------------------+----------+------------------------------+
| crypto_pyaes           | 218 ms   | 230 ms: 1.05x slower (+5%)   |
+------------------------+----------+------------------------------+
| deltablue              | 24.4 ms  | 26.6 ms: 1.09x slower (+9%)  |
+------------------------+----------+------------------------------+
| django_template        | 498 ms   | 518 ms: 1.04x slower (+4%)   |
+------------------------+----------+------------------------------+
| dulwich_log            | 230 ms   | 255 ms: 1.11x slower (+11%)  |
+------------------------+----------+------------------------------+
| fannkuch               | 1.27 sec | 1.31 sec: 1.04x slower (+4%) |
+------------------------+----------+------------------------------+
| float                  | 359 ms   | 383 ms: 1.07x slower (+7%)   |
+------------------------+----------+------------------------------+
| genshi_text            | 114 ms   | 120 ms: 1.06x slower (+6%)   |
+------------------------+----------+------------------------------+
| genshi_xml             | 240 ms   | 251 ms: 1.04x slower (+4%)   |
+------------------------+----------+------------------------------+
| go                     | 656 ms   | 695 ms: 1.06x slower (+6%)   |
+------------------------+----------+------------------------------+
| hg_startup             | 186 ms   | 200 ms: 1.08x slower (+8%)   |
+------------------------+----------+------------------------------+
| html5lib               | 382 ms   | 401 ms: 1.05x slower (+5%)   |
+------------------------+----------+------------------------------+
| json_dumps             | 36.7 ms  | 38.9 ms: 1.06x slower (+6%)  |
+------------------------+----------+------------------------------+
| logging_format         | 37.8 us  | 38.8 us: 1.03x slower (+3%)  |
+------------------------+----------+------------------------------+
| logging_silent         | 1.09 us  | 1.00 us: 1.09x faster (-8%)  |
+------------------------+----------+------------------------------+
| logging_simple         | 32.4 us  | 33.3 us: 1.03x slower (+3%)  |
+------------------------+----------+------------------------------+
| meteor_contest         | 251 ms   | 261 ms: 1.04x slower (+4%)   |
+------------------------+----------+------------------------------+
| nbody                  | 376 ms   | 420 ms: 1.12x slower (+12%)  |
+------------------------+----------+------------------------------+
| nqueens                | 323 ms   | 343 ms: 1.06x slower (+6%)   |
+------------------------+----------+------------------------------+
| pickle_pure_python     | 1.29 ms  | 1.47 ms: 1.13x slower (+13%) |
+------------------------+----------+------------------------------+
| pidigits               | 297 ms   | 315 ms: 1.06x slower (+6%)   |
+------------------------+----------+------------------------------+
| pyflate                | 1.93 sec | 1.98 sec: 1.03x slower (+3%) |
+------------------------+----------+------------------------------+
| python_startup         | 26.1 ms  | 22.6 ms: 1.16x faster (-14%) |
+------------------------+----------+------------------------------+
| python_startup_no_site | 12.8 ms  | 10.3 ms: 1.24x faster (-20%) |
+------------------------+----------+------------------------------+
| raytrace               | 1.71 sec | 1.74 sec: 1.02x slower (+2%) |
+------------------------+----------+------------------------------+
| regex_compile          | 542 ms   | 560 ms: 1.03x slower (+3%)   |
+------------------------+----------+------------------------------+
| regex_dna              | 377 ms   | 400 ms: 1.06x slower (+6%)   |
+------------------------+----------+------------------------------+
| regex_effbot           | 8.18 ms  | 8.38 ms: 1.02x slower (+2%)  |
+------------------------+----------+------------------------------+
| regex_v8               | 87.1 ms  | 92.1 ms: 1.06x slower (+6%)  |
+------------------------+----------+------------------------------+
| richards               | 239 ms   | 258 ms: 1.08x slower (+8%)   |
+------------------------+----------+------------------------------+
| scimark_fft            | 959 ms   | 1.03 sec: 1.07x slower (+7%) |
+------------------------+----------+------------------------------+
| scimark_lu             | 966 ms   | 922 ms: 1.05x faster (-5%)   |
+------------------------+----------+------------------------------+
| scimark_monte_carlo    | 363 ms   | 386 ms: 1.06x slower (+6%)   |
+------------------------+----------+------------------------------+
| scimark_sor            | 702 ms   | 686 ms: 1.02x faster (-2%)   |
+------------------------+----------+------------------------------+
| spambayes              | 229 ms   | 233 ms: 1.02x slower (+2%)   |
+------------------------+----------+------------------------------+
| sqlalchemy_declarative | 585 ms   | 622 ms: 1.06x slower (+6%)   |
+------------------------+----------+------------------------------+
| sqlalchemy_imperative  | 119 ms   | 126 ms: 1.06x slower (+6%)   |
+------------------------+----------+------------------------------+
| sqlite_synth           | 8.10 us  | 8.61 us: 1.06x slower (+6%)  |
+------------------------+----------+------------------------------+
| sympy_expand           | 2.26 sec | 2.31 sec: 1.02x slower (+2%) |
+------------------------+----------+------------------------------+
| sympy_sum              | 499 ms   | 520 ms: 1.04x slower (+4%)   |
+------------------------+----------+------------------------------+
| sympy_str              | 950 ms   | 999 ms: 1.05x slower (+5%)   |
+------------------------+----------+------------------------------+
| telco                  | 907 ms   | 933 ms: 1.03x slower (+3%)   |
+------------------------+----------+------------------------------+
| tornado_http           | 710 ms   | 730 ms: 1.03x slower (+3%)   |
+------------------------+----------+------------------------------+
| unpickle_list          | 19.3 us  | 20.8 us: 1.08x slower (+8%)  |
+------------------------+----------+------------------------------+
| unpickle_pure_python   | 596 us   | 631 us: 1.06x slower (+6%)   |
+------------------------+----------+------------------------------+
| xml_etree_parse        | 316 ms   | 300 ms: 1.05x faster (-5%)   |
+------------------------+----------+------------------------------+
| xml_etree_generate     | 471 ms   | 496 ms: 1.05x slower (+5%)   |
+------------------------+----------+------------------------------+
| xml_etree_process      | 333 ms   | 349 ms: 1.05x slower (+5%)   |
+------------------------+----------+------------------------------+

tested w/ http://pyperformance.readthedocs.io/usage.html

pchome commented 7 years ago

A 25% speedup with Python PGO! Wow!

| python_startup_no_site | 12.8 ms | 10.3 ms: 1.24x faster (-20%) |

🐎

ThGravo commented 7 years ago

I'll be able to take a look at Python more in depth over the next couple of days. It doesn't look super involved thankfully. Unfortunately for me, Glibc 2.26-r1 is actually preventing my rebuild of Python (any version) because it doesn't have the "rpc" stuff in it anymore (it was deprecated and isn't in the USE anymore). I may file a bug upstream about it, as Python seems to hard depend on it. So I would not be able to test any ebuild modifications I make locally.

If anyone does manage to patch the ebuild and submit it upstream, would you mind linking the PR here?

https://bugs.gentoo.org/631488

ThGravo commented 7 years ago

Work-around is simply adding: append-cflags "-I/usr/include/tirpc" to the ebuild (see: https://github.com/ThGravo/THG_gentoo_overlay/blob/master/dev-lang/python/python-3.6.1-r1.ebuild)

InBetweenNames commented 7 years ago

@pchome Nice work! It looks like the python startup times improved with PGO, but many other tests actually were slightly slower--is that within the range of statistical error? The only way I can think of PGO being slower would be if the training set didn't have good exemplar data.

@ThGravo Thanks! I'll give that a shot today. I was wondering if that net-libs/libtirpc might be helpful--looks like it was. Thanks for linking the bug!

pchome commented 7 years ago

It looks like the python startup times improved with PGO, but many other tests actually were slightly slower--is that within the range of statistical error?

No, actually I found the error:

Running code to generate profile data (this can take a while):
make run_profile_task
make[1]: Entering directory '/var/tmp/portage/dev-lang/python-2.7.13/work/x86_64-pc-linux-gnu'
: # FIXME: can't run for a cross build
./python -m test.regrtest --pgo -x test_asyncore test_gdb test_multiprocessing test_subprocess || true
./python: symbol lookup error: ./python: undefined symbol: __gcov_indirect_call_callee
make[1]: Leaving directory '/var/tmp/portage/dev-lang/python-2.7.13/work/x86_64-pc-linux-gnu'
make build_all_merge_profile
make[1]: Entering directory '/var/tmp/portage/dev-lang/python-2.7.13/work/x86_64-pc-linux-gnu'
true
make[1]: Leaving directory '/var/tmp/portage/dev-lang/python-2.7.13/work/x86_64-pc-linux-gnu'
Rebuilding with profile guided optimizations:

so I believe only startup profiled

I have no luck to fix this w/ different flag combination, maybe it's GCC profiler error.

InBetweenNames commented 7 years ago

Linux kernel results are in! I used the sys-process/time command for this to get some more detailed stats.

Method: 1) Build GCC without PGO 2) kernel: make clean; time make -j8 3) Build GCC with PGO 4) kernel: make clean; time make -j8

Results:

noPGO:

1872.85user 58.86system 4:09.28elapsed 774%CPU (0avgtext+0avgdata 215244maxresident)k
216304inputs+1438920outputs (25major+55083489minor)pagefaults 0swaps
PGO:

1687.76user 58.38system 3:39.56elapsed 795%CPU (0avgtext+0avgdata 215872maxresident)k
6656inputs+1438920outputs (35major+55119576minor)pagefaults 0swaps

30 seconds shaved off. Not bad!

Also, I edited the toolchain.eclass to not strip any optimization flags from GCC. So in my own branch of gentoolto, I have sys-devel/gcc as nolto. Both PGO and non-PGO were tested with the same flags.

InBetweenNames commented 7 years ago

@pchome Great catch--I'll hopefully have some time to look at making a similar patch for Python today or tomorrow. I'll submit that upstream and link the PR here as usual. Very cool findings everyone.

darkbasic commented 7 years ago

30 seconds shaved off. Not bad!

Actually it's a lot: it's 12% faster! In that thread they stated it was supposed to be 15% faster which is not too far.

InBetweenNames commented 7 years ago

I expect it'll probably vary on package to package. GCC for example does bootstrapping which will wipe out the PGO benefits early on, resulting in what we saw before. Still, probably worth it for a Gentoo-er :)

It looks like there may be a bug in the GCC buildsystem that does not respect AR, NM, and RANLIB for some packages. I can build it with LTO if I use the linker plugin, but not without. Not something you'd notice unless you removed strip-cflags and friends like I did.

InBetweenNames commented 7 years ago

Ahh yes, found the problem with GCC LTO--details have been posted to the symlink thread to keep things clean in here.

InBetweenNames commented 7 years ago

Of potential interest here:

https://gcc.gnu.org/ml/gcc-patches/2016-04/msg01692.html

https://gcc.gnu.org/wiki/AutoFDO/Tutorial

It looks like this isn't quite ready for prime time yet, but it could potentially make PGO accessible to a wider range of programs (even if it's not as good as explicit PGO). Neat!

pchome commented 7 years ago

I have no luck to fix this w/ different flag combination, maybe it's GCC profiler error.

Meanwhile python-2.7.14 still affected, but python-3.6.1-r1 looks ok. Sandbox should be disabled due to:

0:06:13 [ 74/405] test_compileall
 * ACCESS DENIED:  mkdir:        /usr/lib64/python3.6/site-packages/__pycache__

all tests takes near 1hr on my system

InBetweenNames commented 7 years ago

Did a quick test with emerge and Python 3.6.1 PGO:

Method:

My own Fish shell records the wall time it takes. Results:

PGO: 4.5 minutes
non-PGO: 4.7 minutes

I suspect the emerge time is dominated by the SAT stuff taking place inside it. Still, shaving 12 seconds off isn't bad! I expect the best results would probably come from something like pypy. I'll make a PR upstream to add USE=pgo into Python once I find a clean way to do it.

InBetweenNames commented 7 years ago

PR created upstream for all Python versions in Portage: https://github.com/gentoo/gentoo/pull/5768

darkbasic commented 7 years ago

Wait... Does portage work with pypy?

InBetweenNames commented 7 years ago

I have read that some have had success with pypy2, but no word on pypy3. Would you be willing to try?

darkbasic commented 7 years ago

As I said in another thread unfortunately my dual core laptop isn't fast enough to run Gentoo, so until I will buy a faster desktop I'm stuck with Arch Linux.

InBetweenNames commented 7 years ago

GCC PR accepted: https://github.com/gentoo/gentoo/commit/0591d59df6a846750c267e34a28b8b8d87812101

Go ahead and set your gcc USE="pgo" and enjoy everyone :)

InBetweenNames commented 7 years ago

I encourage everyone to check out the bashrc.d thread as it is indeed related to general PGO!

darkbasic commented 7 years ago

How exactly does it support PGO?

InBetweenNames commented 7 years ago

See the README file--there's an entire section on PGO in there.

https://github.com/vaeth/portage-bashrc-mv/blob/master/bashrc.d/README

mgomersbach commented 7 years ago

Can we have a POC so we can see if it fits into the workflow easily. And does it support overlays?

InBetweenNames commented 6 years ago

As of HEAD, we now have PGO-enabled Python ebuilds. PGO is off by default, but can be enabled by adding pgo to your USE flags. The difference is very noticeable. Closing this thread for now, however don't hesitate to make a new PGO-related thread if more discussion is needed.

jiblime commented 5 years ago

Sorry to revive, I'd rather not open another issue for this. I've noticed that this occurs in the build log when emerging dev-lang/python-3.7.4-r2::lto-overlay.

checking for --enable-optimizations... no
checking for --with-lto... no

So I've done:

echo 'dev-lang/python python-enable-opts.conf' >> /etc/portage/package.env
mkdir -p /etc/portage/env # not needed if env exists
echo 'EXTRA_ECONF="--enable-optimizations"' >> /etc/portage/python-enable-opts.conf

Now build.log shows

checking for --enable-optimizations... yes
checking for --with-lto... no

~~As an aside, adding '--enable-lto' does nothing, but this page shows that it wouldn't matter anyway. https://stackoverflow.com/questions/41405728/what-does-enable-optimizations-do-while-compiling-python~~ https://github.com/docker-library/python/issues/160

...However, sinice the lto-overlay ebuild has PGO, does it even matter to enable --enable-optimizations?


I'll just be adding EXTRA_ECONF="--enable-optimizations --with-lto" to it. I am uncertain of proper syntax.