Closed psteinb closed 6 years ago
Try increasing the verbosity (using one or multiple -v
flags). In the subsequent output, which may be very long, it will include the following table:
Bottlenecks:
level | a. intensity | performance | bandwidth | bandwidth kernel
--------+--------------+-----------------+--------------+-----------------
CPU | | 17.60 GFLOP/s | |
L1 | 0.042 FLOP/B | 2.02 GFLOP/s | 48.45 GB/s | copy
L2 | inf FLOP/B | inf YFLOP/s | 21.17 GB/s | load
L3 | inf FLOP/B | inf YFLOP/s | 10.97 GB/s | load
MEM | inf FLOP/B | inf YFLOP/s | 9.11 GB/s | load
which is basically everything you need to reconstruct a complete roofline.
Is that what you were looking for? If so, just close the ticket or elaborate on your needs.
wow, didn't know that the verbose flag had this in stock. Excellent feature.
However, I am not sure I grasp how this table was made and how it helps me in constructing a roofline plot. I see that 2.02 GFLOP/s
are consumed by this kernel at the current configuration and that the max performance of this machine lies at 17.60 GFLOP/s
(the horizontal line of the roof). How would I get the slope of the roofline from this table?
It gets a bit more confusing for me when I use a stencil kernel:
$ kerncraft -vvv -p Roofline -m /home/steinbac/software/kerncraft/repo/examples/machine-files/IvyBridgeEP_E5-2660v2.yml /home/steinbac/software/kerncraft/repo/examples/kernels/3d-27pt.c -D N 100 -D M 500
#...
Bottlenecks:
level | a. intensity | performance | bandwidth | bandwidth kernel
--------+--------------+-----------------+--------------+-----------------
CPU | | 17.60 GFLOP/s | |
L1 | 0.28 FLOP/B | 13.63 GFLOP/s | 48.45 GB/s | copy
L2 | 0.68 FLOP/B | 22.59 GFLOP/s | 33.47 GB/s | copy
L3 | 0.68 FLOP/B | 10.42 GFLOP/s | 15.44 GB/s | copy
MEM | 1.1 FLOP/B | 12.40 GFLOP/s | 11.03 GB/s | copy
Here I am also a bit lost how each row (except the CPU
one) was produced.
The slope in the Roofline model is the bandwidth limited region, where I * b_s < P_{max}
, where I
is the arithmetic intensity (plotted on the x axis) and b_s
the applicable peak bandwidth (4th column). The prediction will yield the lowest performance (the bottleneck) by using the arithmetic intensity (also referred to as computational intensity), found in the second column.
You find multiple arithmetic intensities in the table, because in each cache level, as well as main memory, fewer accesses may appear. This depends on the access pattern and iteration order in the code. In many cases the main memory accesses will limit the performance, but in the 3d-27pt example with N=100 and M=500, you see that this is not always the case and the L3 cache bandwidth is the limiting bottleneck.
You can find some slides on the Roofline model at https://moodle.rrze.uni-erlangen.de/pluginfile.php/10660/mod_resource/content/1/04_Roofline_Jacobi.pdf
thanks for the reminder ;) I've seen these slides too many times.
I was under the impression that "bandwidth" referred to the memory bandwidth of my kernel under test. I wasn't aware of the fact that it corresponded to the peak bandwidth. I did indeed wonder why the 5th column was there. But that makes a lot of sense now.
Would you allow me to submit a PR to fix this?
PRs are always welcome :)
Am 10.01.2018 um 11:10 schrieb Peter Steinbach notifications@github.com:
thanks for the reminder ;) I've seen these slides too many times.
I was under the impression that "bandwidth" referred to the memory bandwidth of my kernel under test. I wasn't aware of the fact that it corresponded to the peak bandwidth. I did indeed wonder why the 5th column was there. But that makes a lot of sense now.
Would you allow me to submit a PR to fix this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
very cool. thanks for your detailed help. I might look into the text output a bit more as I'd love to make it more machine readable.
@psteinb there is a machine readable output available, by providing --store file.pickle
the internal representation of the results and some intermediate results will be written to file.pickle
in the python pickle file format. However, this interface is not stable and may (but usually does not) change with any minor version.
superb project! Keep going. I just ran with 83f46fb200e5fe0669f54a2e0287d12d935c78fa and got this:
I was wondering if kerncraft could report the roofline itself as well. It looks like from
bw with from copy benchmark
that the code knows the upper limit to performance for this arithmetic intensity.