Closed ThGravo closed 6 years ago
https://bugs.gentoo.org/615412
solution: https://forums.gentoo.org/viewtopic-p-8030692.html#8030692 BTW, very interesting thread
"Speaking about python, can be build with pgo by default. Well 3.5 version take A LOT more time, but lto, graphite and pgo speed gain is ~25%."
By the way, what about profile-guided optimization? It would be interesting to enable it for as many packages as possible.
Edit:
mkdir -p /etc/portage/env/sys-devel
echo 'BOOT_CFLAGS="-mtune=native -O3 -pipe"' >> /etc/portage/env/sys-devel/gcc
echo 'GCC_MAKE_TARGET="profiledbootstrap"' >> /etc/portage/env/sys-devel/gcc
emerge -1v sys-devel/gcc
According to this PGO should also speed up gcc up to 15% when compiling other packages. That sounds awesome.
Does ./configure --enable-optimizations
induce make profile-opt
?
Python-2.7.13/configure.ac:
# Enable optimization flags
AC_SUBST(DEF_MAKE_ALL_RULE)
AC_SUBST(DEF_MAKE_RULE)
Py_OPT='false'
AC_MSG_CHECKING(for --enable-optimizations)
AC_ARG_ENABLE(optimizations, AS_HELP_STRING([--enable-optimizations], [Enable expensive optimizations (PGO, maybe LTO, etc). Disabled by default.]),
[
if test "$withval" != no
then
Py_OPT='true'
AC_MSG_RESULT(yes);
else
Py_OPT='false'
AC_MSG_RESULT(no);
fi],
[AC_MSG_RESULT(no)])
if test "$Py_OPT" = 'true' ; then
# Intentionally not forcing Py_LTO='true' here. Too many toolchains do not
# compile working code using it and both test_distutils and test_gdb are
# broken when you do managed to get a toolchain that works with it. People
# who want LTO need to use --with-lto themselves.
Py_LTO='true'
DEF_MAKE_ALL_RULE="profile-opt"
REQUIRE_PGO="yes"
DEF_MAKE_RULE="build_all"
else
DEF_MAKE_ALL_RULE="build_all"
REQUIRE_PGO="no"
DEF_MAKE_RULE="all"
fi
to avoid question about --with-lto
:
# Enable LTO flags
AC_SUBST(LTOFLAGS)
AC_MSG_CHECKING(for --with-lto)
AC_ARG_WITH(lto, AS_HELP_STRING([--with-lto], [Enable Link Time Optimization in PGO builds. Disabled by default.]),
[
if test "$withval" != no
then
Py_LTO='true'
AC_MSG_RESULT(yes);
else
Py_LTO='false'
AC_MSG_RESULT(no);
fi],
[AC_MSG_RESULT(no)])
if test "$Py_LTO" = 'true' ; then
case $CC in
*clang*)
# Any changes made here should be reflected in the GCC+Darwin case below
LTOFLAGS="-flto"
;;
*gcc*)
case $ac_sys_system in
Darwin*)
LTOFLAGS="-flto"
;;
*)
LTOFLAGS="-flto -fuse-linker-plugin -ffat-lto-objects -flto-partition=none"
;;
esac
;;
esac
fi
A 25% speedup with Python PGO! Wow! Would certainly help my emerge dependency resolution times :)
PGO is something I have been keeping my eye on and I would like to support it if at all possible. I'm going to guess that only packages with test suites are eligible as the profiling information has to come from somewhere--either that or we would require users to use whichever PGOed package for a day to collect such info. I know Firefox also supports PGO, although I haven't had much luck in building with it.
What do you think about doing PGO on a per-package basis until we iron out the details about how best to do it? I'm also not opposed to including modified ebuilds in the repo that offer PGO as a USE flag for certain packages. I expect we may end up having to do such anyways to fix broken ebuilds before they are merged upstream.
Perhaps to start with we could look at GCC and Python, per the comments in this issue?
A 25% speedup with Python PGO! Wow! Would certainly help my emerge dependency resolution times :)
That's exactly what I thought: portage is sooooo slow!
What do you think about doing PGO on a per-package basis until we iron out the details about how best to do it?
Perhaps to start with we could look at GCC and Python, per the comments in this issue?
I completely agree.
I'm also not opposed to including modified ebuilds in the repo that offer PGO as a USE flag for certain packages. I expect we may end up having to do such anyways to fix broken ebuilds before they are merged upstream.
When possibile I would rather prefer to live patch the ebuilds instead. I used to do it when I needed to carry additional patches, but I'm not sure how extensible is this approach.
Yeah I just switched my own Portage over to git to make it easier to upstream patches. Working on this one right now: https://github.com/gentoo/gentoo/pull/5741
I want to get a "pgo" USE in GCC as I had a successful build today with PGO per your comment above
I'll be able to take a look at Python more in depth over the next couple of days. It doesn't look super involved thankfully. Unfortunately for me, Glibc 2.26-r1 is actually preventing my rebuild of Python (any version) because it doesn't have the "rpc" stuff in it anymore (it was deprecated and isn't in the USE anymore). I may file a bug upstream about it, as Python seems to hard depend on it. So I would not be able to test any ebuild modifications I make locally.
If anyone does manage to patch the ebuild and submit it upstream, would you mind linking the PR here?
Yeah I just switched my own Portage over to git to make it easier to upstream patches. Working on this one right now: gentoo/gentoo#5741
Great! Can you please benchmark it against a couple of packages, measuring how much time they need to be compiled with both versions of gcc (gcc has been compiled with and without PGO)?
Sure! Do you have any packages in mind? I was thinking Firefox would be a good one.
Firefox nowadays compiles a lot of Rust, so that’s perhaps not the best idea. I was thinking about the GCC itself.
True--I hadn't considered Rust. I'll do a simple time emerge sys-devel/gcc
with and without PGO and see what happens. I already have the PGOed version installed, so testing will be easy.
Possibly a package which scales very well with the number of cores, meaning that it passes most of the time doing real compilation instead of linking etc. For example libreoffice will take a lot of time to compile despite how many cores you will throw at him: the vast majority of the cores will be semi-idles most of the time.
gcc itself would probably be a good candidate, but be sure to compile it with the very same flags each time (I suggest you to build it without PGO and LTO). Also be sure to avoid the install phase, possibly using the ebuild
binary directly to bypass it.
Even better: the linux kernel. Just run time make -j8
in the source directory.
Wow--barely any difference in GCC. Method:
1) Build GCC with USE="pgo" (use my patch on the PR) 2) time emerge sys-devel/gcc without USE="pgo" (to get the PGO time to build GCC) 3) time emerge sys-devel/gcc without USE="pgo" (to get non-PGO time to build GCC)
PGO:
real 1356.00
user 7310.50
sys 194.84
22.6m
non-PGO:
real 1362.53
user 7369.27
sys 196.55
22.7m
I'll try the linux kernel next
Disappointing, this is just a 0.5%, very far from the 15% stated in that thread :( Let's see if things will improve with the kernel, otherwise let's hope that python with PGO will give us greater benefits.
Indeed! I'm wondering if perhaps GCC is pathological due to the bootstrapping it does. I think the linux kernel will be a better test for sure. Anyone want to try that out?
I can't, my dual core laptop isn't powerful enough for my tastes to be able to run Gentoo so I switched to Arch. I plan to buy a Threadripper in a couple of months, it will be fun then. My goal is to be able to rebuild all the packages (with O3, graphite, LTO and PGO) during the night (so in less than 7-8 hours). Probably a Ryzen 7 would be enough for that, but I will make good use of more cores anyway. I like to rebuild @world each night to be able to spot build failures early, to easily pinpoint them to a specific package upgrade. This, in conjuction with snapper's btrfs snapshots, would help me alot to file good bug reports with ease.
python2.7 PGO preview
not sure if it builds correct since it was too fast for profiled build and I need to check output for any errors
also please note 1xPass benchmark an CPU governor conservative
is active while benchmarking (yes, quick lazy check)
$ cat py2pgo.txt
+------------------------+----------+------------------------------+
| Benchmark | py2 | py2pgo |
+========================+==========+==============================+
| 2to3 | 1.22 sec | 1.31 sec: 1.07x slower (+7%) |
+------------------------+----------+------------------------------+
| chaos | 345 ms | 370 ms: 1.07x slower (+7%) |
+------------------------+----------+------------------------------+
| crypto_pyaes | 218 ms | 230 ms: 1.05x slower (+5%) |
+------------------------+----------+------------------------------+
| deltablue | 24.4 ms | 26.6 ms: 1.09x slower (+9%) |
+------------------------+----------+------------------------------+
| django_template | 498 ms | 518 ms: 1.04x slower (+4%) |
+------------------------+----------+------------------------------+
| dulwich_log | 230 ms | 255 ms: 1.11x slower (+11%) |
+------------------------+----------+------------------------------+
| fannkuch | 1.27 sec | 1.31 sec: 1.04x slower (+4%) |
+------------------------+----------+------------------------------+
| float | 359 ms | 383 ms: 1.07x slower (+7%) |
+------------------------+----------+------------------------------+
| genshi_text | 114 ms | 120 ms: 1.06x slower (+6%) |
+------------------------+----------+------------------------------+
| genshi_xml | 240 ms | 251 ms: 1.04x slower (+4%) |
+------------------------+----------+------------------------------+
| go | 656 ms | 695 ms: 1.06x slower (+6%) |
+------------------------+----------+------------------------------+
| hg_startup | 186 ms | 200 ms: 1.08x slower (+8%) |
+------------------------+----------+------------------------------+
| html5lib | 382 ms | 401 ms: 1.05x slower (+5%) |
+------------------------+----------+------------------------------+
| json_dumps | 36.7 ms | 38.9 ms: 1.06x slower (+6%) |
+------------------------+----------+------------------------------+
| logging_format | 37.8 us | 38.8 us: 1.03x slower (+3%) |
+------------------------+----------+------------------------------+
| logging_silent | 1.09 us | 1.00 us: 1.09x faster (-8%) |
+------------------------+----------+------------------------------+
| logging_simple | 32.4 us | 33.3 us: 1.03x slower (+3%) |
+------------------------+----------+------------------------------+
| meteor_contest | 251 ms | 261 ms: 1.04x slower (+4%) |
+------------------------+----------+------------------------------+
| nbody | 376 ms | 420 ms: 1.12x slower (+12%) |
+------------------------+----------+------------------------------+
| nqueens | 323 ms | 343 ms: 1.06x slower (+6%) |
+------------------------+----------+------------------------------+
| pickle_pure_python | 1.29 ms | 1.47 ms: 1.13x slower (+13%) |
+------------------------+----------+------------------------------+
| pidigits | 297 ms | 315 ms: 1.06x slower (+6%) |
+------------------------+----------+------------------------------+
| pyflate | 1.93 sec | 1.98 sec: 1.03x slower (+3%) |
+------------------------+----------+------------------------------+
| python_startup | 26.1 ms | 22.6 ms: 1.16x faster (-14%) |
+------------------------+----------+------------------------------+
| python_startup_no_site | 12.8 ms | 10.3 ms: 1.24x faster (-20%) |
+------------------------+----------+------------------------------+
| raytrace | 1.71 sec | 1.74 sec: 1.02x slower (+2%) |
+------------------------+----------+------------------------------+
| regex_compile | 542 ms | 560 ms: 1.03x slower (+3%) |
+------------------------+----------+------------------------------+
| regex_dna | 377 ms | 400 ms: 1.06x slower (+6%) |
+------------------------+----------+------------------------------+
| regex_effbot | 8.18 ms | 8.38 ms: 1.02x slower (+2%) |
+------------------------+----------+------------------------------+
| regex_v8 | 87.1 ms | 92.1 ms: 1.06x slower (+6%) |
+------------------------+----------+------------------------------+
| richards | 239 ms | 258 ms: 1.08x slower (+8%) |
+------------------------+----------+------------------------------+
| scimark_fft | 959 ms | 1.03 sec: 1.07x slower (+7%) |
+------------------------+----------+------------------------------+
| scimark_lu | 966 ms | 922 ms: 1.05x faster (-5%) |
+------------------------+----------+------------------------------+
| scimark_monte_carlo | 363 ms | 386 ms: 1.06x slower (+6%) |
+------------------------+----------+------------------------------+
| scimark_sor | 702 ms | 686 ms: 1.02x faster (-2%) |
+------------------------+----------+------------------------------+
| spambayes | 229 ms | 233 ms: 1.02x slower (+2%) |
+------------------------+----------+------------------------------+
| sqlalchemy_declarative | 585 ms | 622 ms: 1.06x slower (+6%) |
+------------------------+----------+------------------------------+
| sqlalchemy_imperative | 119 ms | 126 ms: 1.06x slower (+6%) |
+------------------------+----------+------------------------------+
| sqlite_synth | 8.10 us | 8.61 us: 1.06x slower (+6%) |
+------------------------+----------+------------------------------+
| sympy_expand | 2.26 sec | 2.31 sec: 1.02x slower (+2%) |
+------------------------+----------+------------------------------+
| sympy_sum | 499 ms | 520 ms: 1.04x slower (+4%) |
+------------------------+----------+------------------------------+
| sympy_str | 950 ms | 999 ms: 1.05x slower (+5%) |
+------------------------+----------+------------------------------+
| telco | 907 ms | 933 ms: 1.03x slower (+3%) |
+------------------------+----------+------------------------------+
| tornado_http | 710 ms | 730 ms: 1.03x slower (+3%) |
+------------------------+----------+------------------------------+
| unpickle_list | 19.3 us | 20.8 us: 1.08x slower (+8%) |
+------------------------+----------+------------------------------+
| unpickle_pure_python | 596 us | 631 us: 1.06x slower (+6%) |
+------------------------+----------+------------------------------+
| xml_etree_parse | 316 ms | 300 ms: 1.05x faster (-5%) |
+------------------------+----------+------------------------------+
| xml_etree_generate | 471 ms | 496 ms: 1.05x slower (+5%) |
+------------------------+----------+------------------------------+
| xml_etree_process | 333 ms | 349 ms: 1.05x slower (+5%) |
+------------------------+----------+------------------------------+
A 25% speedup with Python PGO! Wow!
| python_startup_no_site | 12.8 ms | 10.3 ms: 1.24x faster (-20%) |
🐎
I'll be able to take a look at Python more in depth over the next couple of days. It doesn't look super involved thankfully. Unfortunately for me, Glibc 2.26-r1 is actually preventing my rebuild of Python (any version) because it doesn't have the "rpc" stuff in it anymore (it was deprecated and isn't in the USE anymore). I may file a bug upstream about it, as Python seems to hard depend on it. So I would not be able to test any ebuild modifications I make locally.
If anyone does manage to patch the ebuild and submit it upstream, would you mind linking the PR here?
Work-around is simply adding:
append-cflags "-I/usr/include/tirpc"
to the ebuild (see: https://github.com/ThGravo/THG_gentoo_overlay/blob/master/dev-lang/python/python-3.6.1-r1.ebuild)
@pchome Nice work! It looks like the python startup times improved with PGO, but many other tests actually were slightly slower--is that within the range of statistical error? The only way I can think of PGO being slower would be if the training set didn't have good exemplar data.
@ThGravo Thanks! I'll give that a shot today. I was wondering if that net-libs/libtirpc
might be helpful--looks like it was. Thanks for linking the bug!
It looks like the python startup times improved with PGO, but many other tests actually were slightly slower--is that within the range of statistical error?
No, actually I found the error:
Running code to generate profile data (this can take a while):
make run_profile_task
make[1]: Entering directory '/var/tmp/portage/dev-lang/python-2.7.13/work/x86_64-pc-linux-gnu'
: # FIXME: can't run for a cross build
./python -m test.regrtest --pgo -x test_asyncore test_gdb test_multiprocessing test_subprocess || true
./python: symbol lookup error: ./python: undefined symbol: __gcov_indirect_call_callee
make[1]: Leaving directory '/var/tmp/portage/dev-lang/python-2.7.13/work/x86_64-pc-linux-gnu'
make build_all_merge_profile
make[1]: Entering directory '/var/tmp/portage/dev-lang/python-2.7.13/work/x86_64-pc-linux-gnu'
true
make[1]: Leaving directory '/var/tmp/portage/dev-lang/python-2.7.13/work/x86_64-pc-linux-gnu'
Rebuilding with profile guided optimizations:
so I believe only startup profiled
I have no luck to fix this w/ different flag combination, maybe it's GCC profiler error.
Linux kernel results are in! I used the sys-process/time
command for this to get some more detailed stats.
Method: 1) Build GCC without PGO 2) kernel: make clean; time make -j8 3) Build GCC with PGO 4) kernel: make clean; time make -j8
Results:
noPGO:
1872.85user 58.86system 4:09.28elapsed 774%CPU (0avgtext+0avgdata 215244maxresident)k
216304inputs+1438920outputs (25major+55083489minor)pagefaults 0swaps
PGO:
1687.76user 58.38system 3:39.56elapsed 795%CPU (0avgtext+0avgdata 215872maxresident)k
6656inputs+1438920outputs (35major+55119576minor)pagefaults 0swaps
30 seconds shaved off. Not bad!
Also, I edited the toolchain.eclass
to not strip any optimization flags from GCC. So in my own branch of gentoolto, I have sys-devel/gcc
as nolto
. Both PGO and non-PGO were tested with the same flags.
@pchome Great catch--I'll hopefully have some time to look at making a similar patch for Python today or tomorrow. I'll submit that upstream and link the PR here as usual. Very cool findings everyone.
30 seconds shaved off. Not bad!
Actually it's a lot: it's 12% faster! In that thread they stated it was supposed to be 15% faster which is not too far.
I expect it'll probably vary on package to package. GCC for example does bootstrapping which will wipe out the PGO benefits early on, resulting in what we saw before. Still, probably worth it for a Gentoo-er :)
It looks like there may be a bug in the GCC buildsystem that does not respect AR, NM, and RANLIB for some packages. I can build it with LTO if I use the linker plugin, but not without. Not something you'd notice unless you removed strip-cflags
and friends like I did.
Ahh yes, found the problem with GCC LTO--details have been posted to the symlink thread to keep things clean in here.
Of potential interest here:
https://gcc.gnu.org/ml/gcc-patches/2016-04/msg01692.html
https://gcc.gnu.org/wiki/AutoFDO/Tutorial
It looks like this isn't quite ready for prime time yet, but it could potentially make PGO accessible to a wider range of programs (even if it's not as good as explicit PGO). Neat!
I have no luck to fix this w/ different flag combination, maybe it's GCC profiler error.
Meanwhile python-2.7.14
still affected, but python-3.6.1-r1
looks ok.
Sandbox should be disabled due to:
0:06:13 [ 74/405] test_compileall
* ACCESS DENIED: mkdir: /usr/lib64/python3.6/site-packages/__pycache__
all tests takes near 1hr on my system
Did a quick test with emerge
and Python 3.6.1 PGO:
Method:
ebuild USE="pgo" python:3.6 merge
emerge --pretend --ask --update --newuse --deep --keep-going --with-bdeps=y --complete-graph @world
My own Fish shell records the wall time it takes. Results:
PGO: 4.5 minutes
non-PGO: 4.7 minutes
I suspect the emerge time is dominated by the SAT stuff taking place inside it. Still, shaving 12 seconds off isn't bad! I expect the best results would probably come from something like pypy
. I'll make a PR upstream to add USE=pgo
into Python once I find a clean way to do it.
PR created upstream for all Python versions in Portage: https://github.com/gentoo/gentoo/pull/5768
Wait... Does portage work with pypy?
I have read that some have had success with pypy2, but no word on pypy3. Would you be willing to try?
As I said in another thread unfortunately my dual core laptop isn't fast enough to run Gentoo, so until I will buy a faster desktop I'm stuck with Arch Linux.
GCC PR accepted: https://github.com/gentoo/gentoo/commit/0591d59df6a846750c267e34a28b8b8d87812101
Go ahead and set your gcc USE="pgo" and enjoy everyone :)
I encourage everyone to check out the bashrc.d thread as it is indeed related to general PGO!
How exactly does it support PGO?
See the README file--there's an entire section on PGO in there.
https://github.com/vaeth/portage-bashrc-mv/blob/master/bashrc.d/README
Can we have a POC so we can see if it fits into the workflow easily. And does it support overlays?
As of HEAD, we now have PGO-enabled Python ebuilds. PGO is off by default, but can be enabled by adding pgo
to your USE flags. The difference is very noticeable. Closing this thread for now, however don't hesitate to make a new PGO-related thread if more discussion is needed.
Sorry to revive, I'd rather not open another issue for this. I've noticed that this occurs in the build log when emerging dev-lang/python-3.7.4-r2::lto-overlay
.
checking for --enable-optimizations... no
checking for --with-lto... no
So I've done:
echo 'dev-lang/python python-enable-opts.conf' >> /etc/portage/package.env
mkdir -p /etc/portage/env # not needed if env exists
echo 'EXTRA_ECONF="--enable-optimizations"' >> /etc/portage/python-enable-opts.conf
Now build.log shows
checking for --enable-optimizations... yes
checking for --with-lto... no
~~As an aside, adding '--enable-lto' does nothing, but this page shows that it wouldn't matter anyway. https://stackoverflow.com/questions/41405728/what-does-enable-optimizations-do-while-compiling-python~~ https://github.com/docker-library/python/issues/160
...However, sinice the lto-overlay ebuild has PGO, does it even matter to enable --enable-optimizations
?
I'll just be adding EXTRA_ECONF="--enable-optimizations --with-lto"
to it. I am uncertain of proper syntax.
When you have a look at the configure output of python there is potential:
checking for --enable-optimizations... no
and further down the lineIf you want a release build with all optimizations active (LTO, PGO, etc), please run ./configure --enable-optimizations
Do you see a way to incorporate that in your overlay?