InBetweenNames / gentooLTO

A Gentoo Portage configuration for building with -O3, Graphite, and LTO optimizations
GNU General Public License v2.0
571 stars 97 forks source link

ARM64 - sorry to open this as an issue #738

Open sm-moshi opened 3 years ago

sm-moshi commented 3 years ago

Sorry, for opening this, but I didn't know how to address this.

Does it make sense to use LTO for an ARM64 Raspberry Pi on Gentoo?

InBetweenNames commented 3 years ago

Sure, but you might consider cross compiling instead of building on the RPi itself. You can use distcc to do that fairly easily.

AnonymousRetard commented 3 years ago

Unfortunately I don't think you get any speedups when using distcc with LTO though. It seems like the linking step runs locally and it probably needs to. Otherwise I would assume that every computer part of the distcc cluster needs to have exactly the same libraries.

With LTO most of the compilation time is spent in the linking step. A pull request to distcc has recently been accepted that stops it from even trying to distribute LTO work: https://github.com/distcc/distcc/pull/413

Personally I had been wondering for a long time why distcc isn't helping my systems anymore and it's probably because of this. Most of my packages actually seem to compile a bit slower and distribute very badly when trying to use an older version of distcc without that PR merged. Once a new distcc version is released on gentoo with this PR inside the time taken should at least not increase but it won't even have any chance of improving anymore since distcc won't even try to distribute that work.

Since I have LTO enabled system-wide except when I build the kernel it's only the kernel that I see big speed improvements on when using distcc.

This is unfortunately a pretty sad situation because I think it's in smaller and embedded systems that LTO is even more important since it can increase performance and always results in smaller or at least equally sized binaries. So yes it does make a lot of sense to enable it on a raspberry pi, except it increases the compilation time by a lot and you cannot distribute the work. What you would have to do instead is compile the packages with LTO locally on a stronger machine in a build environment for the Raspberry Pi and then distribute the binaries to the Pi. Either that, or accept that it will take a really long time to build all the packages on the RPi itself.

Also for smaller systems I suggest you go with -O2 or -Os instead of -O3 because O3 tends to increase the binary sizes by quite a lot often without actually increasing the performance of them. From my experience from embedded development I have seen -Os sometimes giving the best performance and the smallest binaries (with the latter being the goal of Os, while the former is supposed to be the goal of O3), perhaps because the smaller code more easily fits into the small cache sizes available on such processors. As you can see here some of the more advanced optimization flags enabled by this overlay are sometimes decreasing the performance. Dropping all the extra advanced optimizations and running only: -march=native -O2 -flto Should consistently give the best performance and binary sizes across the board (without any performance loss). But march=native only works if you are compiling locally on the pi.

Apart from taking less space (which you might not have a lot of in smaller system) smaller binaries also have a faster startup-time which is often the dominant problem when running programs with short runtimes. This is especially true if the backing storage is quite slow (like an SD card) and if there's not much RAM available acting as a filesystem cache.

shelterx commented 3 years ago

Yes, the linking is done locally but it also depends on the source code, bigger packages certainly compiles faster using distcc. llvm went down to ~40 mins from 1h30mins. The local machine is an i5 laptop, the distcc "server" is an i7 2600k.

EDIT: Also, O3 is probably not needed (like @AnonymousRetard wrote), it can make some stuff faster but other stuff slower unless the program is specifically written for O3 optimization. So it's really no gain in using O3 as default, it tends to even out in the end anyway.

AnonymousRetard commented 3 years ago

@shelterx Are you sure you actually built llvm with -flto though? Because of this issue: https://github.com/InBetweenNames/gentooLTO/issues/619 -flto is actually stripped from the llvm package in this overlay: /etc/portage/package.cflags/lto.conf:sys-devel/llvm *FLAGS-=-flto* # Issue #619 temporarily disabled for now due to build errors This means you would have to have built llvm before this change was added/isn't using lto.conf from this overlay/modified it yourself in order to actually build llvm with LTO.

My output from "emerge --info llvm" also confirms that the CFLAGS & CXXFLAGS don't have -flto present.

When -flto is enabled I think all the code optimization is skipped in the compiling step and instead done during the linking step. This is why the linking step takes so much longer and the distcc helpers can't really help much with the actual compilation steps either. Sending the source code and the results back over the network is likely just slowing down the whole process.

shelterx commented 3 years ago

@AnonymousRetard Ooops, you are correct and that would explain why I don't see some stuff getting passed to distcc server. However qtcore is compiled with -flto=auto and it's faster but I agree, it doesn't help THAT much but overall I think you gain more than you lose. no distcc: 2021-04-13T15:19:03 >>> dev-qt/qtcore: 7′16″ distcc: 2021-05-11T15:14:36 >>> dev-qt/qtcore: 5′39″

here's another example: no distcc: 2021-05-03T12:11:13 >>> kde-apps/kate: 3′35″ distcc: 2021-05-14T10:37:13 >>> kde-apps/kate: 2′23″

AnonymousRetard commented 3 years ago

@shelterx This is quite interesting. I might do some of my own tests on these packages later. I have a 4 core weak AMD system used as a server and a strong 16 core 5950X. I don't have specific examples since it's a long time ago I tried this last but I remember being very disappointed in DISTCC performance and actually seeing slowdowns from it on quite a few packages. Very few jobs where being distributed to the 5950X and the majority of the time when building packages was spent compiling things locally. These issues completely disappeared when building packages without -flto but I decided that I rather build stuff locally with LTO than try to speed up the jobs with DISTCC.

This discussion should perhaps continue somewhere else though. The issue tracker on distcc is probably a better place: https://github.com/distcc/distcc

As I mentioned in my original reply a PR has been merged that looks like distcc will soon stop trying to distribute -flto jobs completely so we'll have to raise an issue there if we want to change that behavior in a future release. Perhaps it helps in some cases but not others but I'm not sure if that can be detected automatically.