Open ibogosavljevic opened 2 years ago
Hello,
The cost is included within the Core Bound
metric per its description:
$ ./pmu-tools/toplev.py --describe Core_Bound
Backend_Bound.Core_Bound
This metric represents fraction of slots where Core non-
memory issues were of a bottleneck. Shortage in hardware
compute resources; or dependencies in software's
instructions are both categorized under Core Bound. Hence it
may indicate the machine ran out of an out-of-order
resource; certain execution units are overloaded or
dependencies in program's data- or instruction-flow are
limiting the performance (e.g. FP-chained long-latency
arithmetic operations).
Here is a demo using run.sh
. By default it runs some python code; which you can see has Core Bound at ~12% of slots.
$ ./do.py profile -pm 20 -v2
topdown 2-levels 3 runs ..
# 4.3-full-perf on 11th Gen Intel(R) Core(TM) i7-11700B @ 3.20GHz [tgl/icelake]
FE Frontend_Bound % Slots 4.2 < [24.5%] +- 0.3
BAD Bad_Speculation % Slots 4.9 < [24.5%] +- 6.0
BE Backend_Bound % Slots 13.5 < [18.1%] +- 0.9
RET Retiring % Slots 77.4 [52.0%] +- 5.9
Info.Core CoreIPC Core_Metric 3.99 [24.1%] +- 0.2
Info.Inst_Mix Instructions Count 10,091,398,631 [24.1%] +- 474,295,735.7
Info.Inst_Mix IpTB Inst_Metric 14.45 [24.1%] +- 1.0
FE Frontend_Bound.Fetch_Bandwidth % Slots 3.1 < [24.5%] +- 0.3
BAD Bad_Speculation.Machine_Clears % Slots 3.7 < [24.5%] +- 6.2
******* BE/Core Backend_Bound.Core_Bound % Slots 11.7 < [18.1%] +- 0.9**
RET Retiring.Light_Operations % Slots 77.3 [24.5%] +- 5.9<==
Info.Thread IPC Metric 3.99 [24.1%] +- 0.2
Info.System CPU_Utilization Metric 0.77 [24.1%] +- 0.0
Info.System Time Seconds 0.68 +- 0.0
******* Info.Core ILP Core_Metric 9.08 [24.1%] +- 0.5
Info.Core IpMispredict Inst_Metric 9,804.5 [24.1%] +- 4,077.3
Info.Core CORE_CLKS Count 2,526,853,303 [24.1%] +- 103,600,985.4
MUX % 18.12
Now this is high IPC code so it is not the data-dependency case you are after. The ILP
metric is helpful though to attributes Core Bound to one of the two documented cases.
You can tweak that to run your app and then share output using do.py tar
if you like help in analysis.
In one program, an instruction X is stuck and nothing else can go through until the instruction X is unstuck, e.g. when the piece of data from the memory arrives or a computational resource becomes available. The pipeline is full of instructions in various stages of execution, but nothing can go forward because X has not completed.
But, in another program, instruction Y is also stuck for the same reason, but the CPU is still executing instructions because there is enough ILP.
Both programs will have a high number in the corresponding CoreBound metric.
Now, is it possible to read some CPU counters to figure out what percentage of cycles/slots the CPU was idle because it didn't have anything to do?
The program with instruction X might be either memory bound or Core Bound depending whether the RS got drained while the memory load from X was pending. This key in the TMA split of Backend Bound.
There are multiple counters. What is the pipe stage in your counter quest?
Hi!
Is there a way to detect wasted slots due to instruction dependency chains in my code? The code is high on data cache misses, but to make things worse, there are loop carried dependecies and little available instruction level parallelism. There are two approaches to fix this: decrease data cache misses or increase available ILP. How to detect this?