RRZE-HPC / kerncraft

Loop Kernel Analysis and Performance Modeling Toolkit
GNU Affero General Public License v3.0
88 stars 24 forks source link

reporting the roof with Roofline #84

Closed psteinb closed 6 years ago

psteinb commented 6 years ago

superb project! Keep going. I just ran with 83f46fb200e5fe0669f54a2e0287d12d935c78fa and got this:

kerncraft -p Roofline -m /home/steinbac/software/kerncraft/repo/examples/machine-files/IvyBridgeEP_E5-2660v2.yml /home/steinbac/software/kerncraft/repo/examples/kernels/add.c -D N 1000  
                                   kerncraft                                    
/home/steinbac/software/kerncraft/repo/examples/kernels/add.c-m /home/steinbac/software/kerncraft/repo/examples/machine-files/IvyBridgeEP_E5-2660v2.yml
-D N 1000
----------------------------------- Roofline -----------------------------------
Cache or mem bound with 1 core(s)
2.02 GFLOP/s due to L1 transfer bottleneck (bw with from copy benchmark)
Arithmetic Intensity: 0.04 FLOP/B

I was wondering if kerncraft could report the roofline itself as well. It looks like from bw with from copy benchmark that the code knows the upper limit to performance for this arithmetic intensity.

cod3monk commented 6 years ago

Try increasing the verbosity (using one or multiple -v flags). In the subsequent output, which may be very long, it will include the following table:

Bottlenecks:
  level | a. intensity |   performance   |   bandwidth  | bandwidth kernel
--------+--------------+-----------------+--------------+-----------------
    CPU |              |   17.60 GFLOP/s |              |
     L1 | 0.042 FLOP/B |    2.02 GFLOP/s |   48.45 GB/s | copy
     L2 |   inf FLOP/B |     inf YFLOP/s |   21.17 GB/s | load
     L3 |   inf FLOP/B |     inf YFLOP/s |   10.97 GB/s | load
    MEM |   inf FLOP/B |     inf YFLOP/s |    9.11 GB/s | load

which is basically everything you need to reconstruct a complete roofline.

Is that what you were looking for? If so, just close the ticket or elaborate on your needs.

psteinb commented 6 years ago

wow, didn't know that the verbose flag had this in stock. Excellent feature.

However, I am not sure I grasp how this table was made and how it helps me in constructing a roofline plot. I see that 2.02 GFLOP/s are consumed by this kernel at the current configuration and that the max performance of this machine lies at 17.60 GFLOP/s (the horizontal line of the roof). How would I get the slope of the roofline from this table?

It gets a bit more confusing for me when I use a stencil kernel:

$  kerncraft -vvv -p Roofline -m /home/steinbac/software/kerncraft/repo/examples/machine-files/IvyBridgeEP_E5-2660v2.yml /home/steinbac/software/kerncraft/repo/examples/kernels/3d-27pt.c -D N 100 -D M 500
#...
Bottlenecks:
  level | a. intensity |   performance   |   bandwidth  | bandwidth kernel
--------+--------------+-----------------+--------------+-----------------
    CPU |              |   17.60 GFLOP/s |              |
     L1 |  0.28 FLOP/B |   13.63 GFLOP/s |   48.45 GB/s | copy    
     L2 |  0.68 FLOP/B |   22.59 GFLOP/s |   33.47 GB/s | copy    
     L3 |  0.68 FLOP/B |   10.42 GFLOP/s |   15.44 GB/s | copy    
    MEM |   1.1 FLOP/B |   12.40 GFLOP/s |   11.03 GB/s | copy

Here I am also a bit lost how each row (except the CPU one) was produced.

cod3monk commented 6 years ago

The slope in the Roofline model is the bandwidth limited region, where I * b_s < P_{max}, where I is the arithmetic intensity (plotted on the x axis) and b_s the applicable peak bandwidth (4th column). The prediction will yield the lowest performance (the bottleneck) by using the arithmetic intensity (also referred to as computational intensity), found in the second column.

You find multiple arithmetic intensities in the table, because in each cache level, as well as main memory, fewer accesses may appear. This depends on the access pattern and iteration order in the code. In many cases the main memory accesses will limit the performance, but in the 3d-27pt example with N=100 and M=500, you see that this is not always the case and the L3 cache bandwidth is the limiting bottleneck.

You can find some slides on the Roofline model at https://moodle.rrze.uni-erlangen.de/pluginfile.php/10660/mod_resource/content/1/04_Roofline_Jacobi.pdf

psteinb commented 6 years ago

thanks for the reminder ;) I've seen these slides too many times.

I was under the impression that "bandwidth" referred to the memory bandwidth of my kernel under test. I wasn't aware of the fact that it corresponded to the peak bandwidth. I did indeed wonder why the 5th column was there. But that makes a lot of sense now.

Would you allow me to submit a PR to fix this?

cod3monk commented 6 years ago

PRs are always welcome :)

Am 10.01.2018 um 11:10 schrieb Peter Steinbach notifications@github.com:

thanks for the reminder ;) I've seen these slides too many times.

I was under the impression that "bandwidth" referred to the memory bandwidth of my kernel under test. I wasn't aware of the fact that it corresponded to the peak bandwidth. I did indeed wonder why the 5th column was there. But that makes a lot of sense now.

Would you allow me to submit a PR to fix this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

psteinb commented 6 years ago

very cool. thanks for your detailed help. I might look into the text output a bit more as I'd love to make it more machine readable.

cod3monk commented 6 years ago

@psteinb there is a machine readable output available, by providing --store file.pickle the internal representation of the results and some intermediate results will be written to file.pickle in the python pickle file format. However, this interface is not stable and may (but usually does not) change with any minor version.