InBetweenNames / gentooLTO

A Gentoo Portage configuration for building with -O3, Graphite, and LTO optimizations
GNU General Public License v2.0
570 stars 97 forks source link

autopar/offloading etc #158

Open sjnewbury opened 5 years ago

sjnewbury commented 5 years ago

In addition to LTO/Graphite, I also build with Auto-Parallelisation where possible. I've converted my own custom flag management hacks over to gentooLTO including reworking my auto-prelink portage hook script to a portage-bashrc-mv drop-in.

In the gentoo-gpu overlay I'm maintaining a modified toolchain.eclass with offload support which enables building offload-gcc-* plugin ebuilds to enable OpenMP/OpenACC to utilise compute hardware such as GPUs and Intel MIC devices. Recently AMD has announced a plugin for ROCm. This could potentially dovetail quite nicely with autopar.

Sadly, I don't have the hardware to test offloading. It requires a relatively recent NVidia GPU (at least Fermi and earlier are not supported), ROCm requires either a PCIe port with v3 Atomics or a Vega with PCIe v2, while I have a POLARIS10 with PCIe v2!!

If anybody daring enough could test emerging a few packages with autopar+offloading enabled that would be pretty cool and give me an incentive to keep up the maintenance. Any takers?

InBetweenNames commented 5 years ago

I'm interested in this! I can see auto-vectorization being a clear winner in many situations (especially with the appropriate ISA), but it had never occurred to me to enable auto-parallelization as well system wide. For the uninitiated, auto-vectorization uses the available SIMD lanes on your processor to run multiple iterations of your loops in parallel, and auto-parallelization will attempt to create multiple threads that correspond to different iterations of your loops. They are not mutually exclusive -- See any modern OpenCL CPU implementation where both are used. The notion of auto-offload is interesting as well.

AFAIK, auto-vectorization is already performed by GCC at the optimization levels supported by this overlay, but auto-parallelization and auto-offload are not.

It could be interesting to add a flag to sys-config/ltoize to enable your configuration. I have a 1080 Ti in my system, so I think the OpenACC stuff should work at least. I can't say anything about ROCm, though. At the very least, I'd be interested in trying out auto-parallelization.

sjnewbury commented 5 years ago

@InBetweenNames Presumably you're using the proprietary driver for your 1080 Ti given it isn't reclocked by Nouveau AFAIK. In that case, you should be able to test offloading quite easily. Just remember to set eclass-overrides in repos.conf to use the toolchain.ecass in gentoo-gpu. Then you should be able to just set the offload-nvptx USE flag when emerging sys-devel/gcc (7 or 8). I would test with something supporting OpenMP or OpenACC first before rebuilding @world with autopar! ;-)

By the way, enabling autopar system-wide doesn't have too much of negative consequence in the case where no advantage is gained since Gentoo has -as-needed set durinig linking libgomp and libpthread get dropped if they aren't used.

This has caused less issues than another system-wide experiment I've had some success with; namely linking everything with jemalloc! I do need to try AutoFDO...

InBetweenNames commented 5 years ago

I'll give it a shot after I investigate the AudoFDO stuff more for sure. Browsing the GCC documentation, I found this:

-floop-parallelize-all Use the Graphite data dependence analysis to identify loops that can be parallelized. Parallelize all the loops that can be analyzed to not contain loop carried dependences without checking that it is profitable to parallelize the loops.

I'm a bit concerned about the above as it doesn't check if it's profitable to parallelize the loops, it just does it if the loop has no loop carried dependencies. Does the cost function kick in somewhere else? I'm hesitant to enable any options that don't have a cost function associated with them. Part of the reason I enable Graphite auto-vectorization is that it does do cost analysis on the code generation.

sjnewbury commented 5 years ago

I thought that at first, and I believe it is part of the reason why very few have used it. As described in the link I provided above: "You can trigger it by 2 flags -floop-parallelize-all -ftree-parallelize-loops=4. Both of them are needed, the first flag will trigger Graphite pass to mark loops that can be parallel and the second flag will trigger the code generation part." So, the GCC documentation is misleading since -floop-parallelize-all doesn't parallelize loops at all, but mark loops that can be parallelized during a later pass.

Here's a full quote from the link with a few comments:

Automatic parallelization in GCC Automatic parallelization distributes sequential code into multi-threaded code. It automatically generates parallel (multi-threaded) code for specific loop constructs using the gomp library.

This means all code that is successfully parallelized needs to be linked with libgomp, I do this by using the -fopenmp compiler flag. As I mentioned -Wl,--as-needed drops libgomp where no parallelization takes place.

The first version of the code, allowing parallelization of inner-most loops that carry no dependences, was contributed by Zdenek Dvorak and Sebastian Pop (integrated to GCC 4.3). The feature was later enhanced with reduction dependencies and outer loops support by Razya Ladelsky (GCC 4.3).

This was a long time ago, I think the the GCC documentation might date back to this, I haven't checked though.

Number of threads is currently determined by the user via the compile command (-ftree-parallelize-loops=4)

There are simple profitability conditions:

Based on the profile information to determine how frequently the loop is executed, Examining whether the number of iterations is large enough to create new threads If a loop satisfies the correctness and profitability conditions, GIMPLE_OMP_PARALLEL and GIMPLE_OMP_FOR codes are added (and OMP_ATOMIC for reduction support), and later expanded by the omp expansion machinery.

I'm sure there are still plenty of cases where the overhead is greater than the gain, but the gain can be significant:

SPEC2006 speedups with autopar After refining the cost model, http://gcc.gnu.org/ml/gcc-patches/2012-05/msg00881.html (GCC4.8), the following speedups are obtained on a Power7 with 6 cores, 4 way SMT each, comparing the trunk with O3 + autopar (parallelizing with 6 threads) vs. the trunk with O3 minus vectorization:

462.libquantum 2.5X

410.bwaves 3.3X

436.cactusADM 4.5X

459.GemsFDTD 1.27X

481.wrf 1.25X

Note: The speedup shown for libquatum with autopar has been obtained with previous versions of autopar, gaining the performance did not need the cost model change.

Autopar integration with Graphite With the integration of Graphite (http://gcc.gnu.org/wiki/Graphite) to GCC4.4, a strong loop nest analysis and transformation engine was introduced, and the notion of using the polyhedral model to expose loop parallelism in GCC became feasible and relevant.

The first step, teaching Graphite that parallel code needs to be produced, was accomplished (GCC4.4). Graphite recognizes simple parallel loops (using SCoP detection and data dependency analysis), and passes on that information. Graphite annotates parallel loops and passes that information all the way through CLOOG to the current autopar code generator to produce the parallel, GOMP based code.

You can trigger it by 2 flags -floop-parallelize-all -ftree-parallelize-loops=4. Both of them are needed, the first flag will trigger Graphite pass to mark loops that can be parallel and the second flag will trigger the code generation part. See also Automatic Parallelization in Graphite

Teaching Graphite to find loop transformations (such as skewing, interchange etc.) that expose coarse grain synchronization free parallelism, and furthermore, handling parallelism requiring a small amount of synchronization were part of the graphite-autopar integration plans/TODOs detailed in http://gcc.gnu.org/ml/gcc/2009-03/msg00239.html, but have not yet been followed.

I'm certain if more use was made of autopar it would get worked on more. It's getting ever more useful with multi-core and HSA.

InBetweenNames commented 5 years ago

I agree the benefits should be very significant when the optimization is applied in the right places, but reading this paper states that these kind of heuristics have yet to be developed. Granted, that was from 2011. I'm about to start reading the GCC sources to see if any work has been done since then on this aspect, as it is critical. From the paper:

6.1 Prospective work There are two main issues that are the focus of the prospective work on automatic parallelization: – Heuristics/cost model for automatic parallelization Currently, a very simple method is used to determine whether it is profitable to parallelize a certain loop. We need a good model to determine if we should parallelize a loop considering performance reasons. – Advanced automatic parallelization Currently we are only able to detect whether a loop is parallel or not. We would like to explicitly apply transformations to increase and expose further parallelism opportunities, and we have shown that graphite is the right setting to design such transformations

From what you quoted, it sounds like it needs profile information to determine if it should be applied? In either case, what you quoted suggests that the cost function is in -ftree-parallelize-loops, if it exists, so that's a good place to start looking. I just wouldn't want to unconditionally parallelize all loops without loop carried dependencies. For short lived loops, that could be really detrimental to performance.

sjnewbury commented 5 years ago

I don't think the cost model is the main issue with autopar currently. I did a little experimenting with nbench, it generally decides not to parallelize the tests but where it did made the timing function multiply by number of threads. I added support for POSIX clock_gettime() which fixed the timer calculation but I see no gain at all, just slightly slower as number of threads rises which is weird. It's like it's still only counting the iterations of 1 thread which is slowed by overhead. Maybe the iteration count becomes thread local and doesn't get incremented correctly? I'm going to look at the assembly and maybe make a new GCC bug.

I've been reading a little of the GCC Bugzilla and it seems fixes/improvements are planned for gcc-9. Perhaps an idea to wait until then? :-)

InBetweenNames commented 5 years ago

Yeah, I'm in favour of leaving this issue open for more discussion/insight for sure, as with the right application this could be a big win. Please keep us updated on what you find! :) Also, feel free to post relevant GCC bugzilla threads here. We have a few power users who are very interested in this stuff.

sjnewbury commented 5 years ago

I'm missing the obvious. nbench is measuring the time taken accumulatively to perform the operation, not work done. Presumably this means nbench will also give the wrong result for vectorized code. I'm going to see if I can modify it to give meaningful results with modern compilers without compromising the benchmark too much.

sjnewbury commented 5 years ago

It case I was unclear what I mean is autopar is working correctly, but the code explicitly measures the time taken to perform the test and accumulates that measurement for each run whether parallel or not. So in the end it measures the thread throughput which you would expect to decline as additional threads are added if only because the cpu runs a faster clock with less load.

sjnewbury commented 5 years ago

Update: In case anybody wants to try this with AMD offloading, the AMDGPU GCC backend isn't there yet. I was overestimating the amount of progress which had been made and it doesn't yet support being used as a offload accelerator. Supported NVidia devices and Intel MIC devices should work though.