HPCE / hpce-2017-cw6

2 stars 17 forks source link

Get call stack with perf #5

Closed natoucs closed 6 years ago

natoucs commented 6 years ago

Using perf, I can access the call stack of a given function. However it is quite unprecise with missing called functions and functions whose call stack can not be accessed.

Example: below scomp_LinearComp is not expandable whereas it does call other functions + Rabbit only seems to call two functions whereas it does call more:

screenshot from 2017-11-28 12-36-41

How could I change this, increase the sample rate ? Change profiler ? Tried both and did not seem to help.

m8pple commented 6 years ago

I think you're seeing the effect of the inliner - you obviously want to see the performance results for optimised compiles, but annoyingly those are the most difficult to look at.

Something that you can do (if you have debug symbols turned on with -g) is to use the "annotate" feature of perf. You can either do:

perf annotate <function-name>

or more conveniently you can press a in the tree view (so for example move the selection to scomp_LinearComp in the above screen, then press a).

This lets you see which instructions are expensive, and it will map them back to source lines (with debug info), but sometimes the report can be very hard to read. The optimiser will often do strange things, so mapping an instruction back to a line can be complicated, especially for large functions.

Another useful approach is to decorate functions with __attribute__((noinline)), which forces the compiler to not inline that function. However, you want to use it quite carefully, as if too many functions are out-of-line then it starts to distort the performance view.

Sometimes you just have a mega-function, so it is difficult to identify where time is spent - in those cases you may even want to introduce a new function in order to get a better sense of which parts matter. For example, you might have a function that looks like:

void something(...)
{
  // Phase 1
  for(int i=0; i<n; i++){
     ...
  }

  // Phase 2
  for(int i=0; i<2*n-1; i++){
     ...
  }
}

In those cases you might decide to hoist the phases out into functions.

A sneaky way of doing that in C++11 is to temporarily turn them into lambda functions, e.g:

void something(...)
{
  // Phase 1
  std::function<void()> p1=[&]()
  {
     for(int i=0; i<n; i++){
        ...
     }
   };
   p1();

  // Phase 2
  std::function<void()> p2=[&]()
  {
    for(int i=0; i<2*n-1; i++){
       ...
    }
  }
  p2();
}

Using std::function forces an element of "type erasure" in, which means that the compiler will tend to view p1 and p2 as completely different functions. Though be aware that: 1 - The lambda functions will probably be called something very strange. 2 - You're stopping the compiler from optimising across the lambda boundary, so be very careful about accessing things from outside the lambda.

natoucs commented 6 years ago

Thank you very much for these detailed explanations.

When I use a to annotate a function, I get: screenshot from 2017-11-28 17-44-46

I understand the hotspots are in red and that the numbers to the left are the percentage of total samples recorded against a particular instruction.

So to analyse the bottleneck further than this I would need to navigate to the relevant line number and start my optimisations from there, is that right ?

m8pple commented 6 years ago

From an optimisation point of view yes, but I would encourage you not to think too much about line-level optimisation to start with. The low-level view is more guiding you towards the loops and parts of the function that are expensive - now you should try to understand the surrounding compute a bit more, rather than spending too much time on very local optimisations.

It is better to come up with a high-level understanding of tasks, dependencies, and data-flows within the bottleneck functions first. Any optimisations you make are more likely to obscure them, and as part of transforming the code into a more efficient/parallel structure you might find the original optimisations are irrelevant.

So play around by all means, and see if some simple tweaks around those lines gets some easy performance wins, but that should be part of the process of gaining information that allows for more substantial changes. Later on you can come back to micro optimisations, but only once everything else is fairly fixed.

(Though just my suggestion, you're not required to do that).

natoucs commented 6 years ago

Sure thank you!