aayasin / perf-tools

A collection of performance analysis tools, recipes, handy scripts, microbenchmarks & more
124 stars 21 forks source link

Detecting wasted slots due to instruction dependency chain #7

Open ibogosavljevic opened 2 years ago

ibogosavljevic commented 2 years ago

Hi!

Is there a way to detect wasted slots due to instruction dependency chains in my code? The code is high on data cache misses, but to make things worse, there are loop carried dependecies and little available instruction level parallelism. There are two approaches to fix this: decrease data cache misses or increase available ILP. How to detect this?

aayasin commented 2 years ago

Hello, The cost is included within the Core Bound metric per its description:

$ ./pmu-tools/toplev.py --describe Core_Bound
Backend_Bound.Core_Bound
        This metric represents fraction of slots where Core non-
        memory issues were of a bottleneck.  Shortage in hardware
        compute resources; or dependencies in software's
        instructions are both categorized under Core Bound. Hence it
        may indicate the machine ran out of an out-of-order
        resource; certain execution units are overloaded or
        dependencies in program's data- or instruction-flow are
        limiting the performance (e.g. FP-chained long-latency
        arithmetic operations).

Here is a demo using run.sh. By default it runs some python code; which you can see has Core Bound at ~12% of slots.

$ ./do.py profile -pm 20 -v2
topdown 2-levels 3 runs ..
# 4.3-full-perf on 11th Gen Intel(R) Core(TM) i7-11700B @ 3.20GHz [tgl/icelake]
FE             Frontend_Bound                      % Slots                       4.2  < [24.5%] +-      0.3
BAD            Bad_Speculation                     % Slots                       4.9  < [24.5%] +-      6.0
BE             Backend_Bound                       % Slots                      13.5  < [18.1%] +-      0.9
RET            Retiring                            % Slots                      77.4    [52.0%] +-      5.9
Info.Core      CoreIPC                               Core_Metric                 3.99   [24.1%] +-      0.2
Info.Inst_Mix  Instructions                          Count          10,091,398,631      [24.1%] +- 474,295,735.7
Info.Inst_Mix  IpTB                                  Inst_Metric                14.45   [24.1%] +-      1.0
FE             Frontend_Bound.Fetch_Bandwidth      % Slots                       3.1  < [24.5%] +-      0.3
BAD            Bad_Speculation.Machine_Clears      % Slots                       3.7  < [24.5%] +-      6.2
*******       BE/Core        Backend_Bound.Core_Bound            % Slots                      11.7  < [18.1%] +-      0.9**
RET            Retiring.Light_Operations           % Slots                      77.3    [24.5%] +-      5.9<==
Info.Thread    IPC                                   Metric                      3.99   [24.1%] +-      0.2
Info.System    CPU_Utilization                       Metric                      0.77   [24.1%] +-      0.0
Info.System    Time                                  Seconds                     0.68   +-      0.0
*******       Info.Core      ILP                                   Core_Metric                 9.08   [24.1%] +-      0.5
Info.Core      IpMispredict                          Inst_Metric             9,804.5    [24.1%] +-  4,077.3
Info.Core      CORE_CLKS                             Count           2,526,853,303      [24.1%] +- 103,600,985.4
MUX                                                %                            18.12

Now this is high IPC code so it is not the data-dependency case you are after. The ILP metric is helpful though to attributes Core Bound to one of the two documented cases.

You can tweak that to run your app and then share output using do.py tar if you like help in analysis.

ibogosavljevic commented 2 years ago

In one program, an instruction X is stuck and nothing else can go through until the instruction X is unstuck, e.g. when the piece of data from the memory arrives or a computational resource becomes available. The pipeline is full of instructions in various stages of execution, but nothing can go forward because X has not completed.

But, in another program, instruction Y is also stuck for the same reason, but the CPU is still executing instructions because there is enough ILP.

Both programs will have a high number in the corresponding CoreBound metric.

Now, is it possible to read some CPU counters to figure out what percentage of cycles/slots the CPU was idle because it didn't have anything to do?

aayasin commented 2 years ago

The program with instruction X might be either memory bound or Core Bound depending whether the RS got drained while the memory load from X was pending. This key in the TMA split of Backend Bound.

There are multiple counters. What is the pipe stage in your counter quest?