andikleen / pmu-tools

Intel PMU profiling tools
GNU General Public License v2.0
1.98k stars 331 forks source link

TMA: Ports_Utilized_0 is overestimated for FE-bound tests #422

Open aayasin opened 2 years ago

aayasin commented 2 years ago

The Ports_Utilized_0 node accounts for cases where zero ports are utilized while the bottleneck is Backend_Bound.Core_Bound. For front-end starved code patterns, this metric overcounts since there would be no uops to execute to start with. This bug got re-introduced with the fix for Pause-loop in TMA 4.2 Below is a reproducer a kernel for DSB-misses using perf-tools.

$ ./do.py build profile -a dsb-jmp -g "jumpy-seq -a3 -n30 -i 'add %rax,%rcx' JMP" -ki 120e6 -pm 20 -m '+Core_Bound*/6' -v2
building kernel: dsb-jmp ..
/usr/bin/python ./kernels/gen-kernel.py jumpy-seq -a3 -n30 -i 'add %rax,%rcx' JMP > ./kernels/dsb-jmp.c
gcc -O2 -g -o ./kernels/dsb-jmp ./kernels/dsb-jmp.c 2>&1
topdown 2-levels ..
./pmu-tools/toplev.py --no-desc -vl2 --nodes '+CoreIPC,+Instructions,+CORE_CLKS,+CPU_Utilization,+Time,+MUX,+Core_Bound*/6' \
-- taskset 0x4 ./kernels/dsb-jmp 120000000 2>&1 | tee ... 
# 4.3-full-perf on 11th Gen Intel(R) Core(TM) i7-11700B @ 3.20GHz [tgl/icelake]
FE             Frontend_Bound                                                                                % Slots                         72.2    [13.8%]
RET            Retiring                                                                                      % Slots                         27.5  < [18.0%]
Info.Core      CoreIPC                                                                                         Core_Metric                    1.40   [13.8%]
Info.Inst_Mix  Instructions                                                                                    Count              7,408,337,763      [13.8%]
FE             Frontend_Bound.Fetch_Latency                                                                  % Slots                         45.5    [13.8%]<==
FE             Frontend_Bound.Fetch_Bandwidth                                                                % Slots                         26.6  < [13.8%]
RET            Retiring.Light_Operations                                                                     % Slots                         27.4  < [18.0%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization                                                    % Clocks                        94.4  < [18.0%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0                                   % Clocks                        33.1  < [36.1%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_1                                   % Clocks                        60.1  < [32.1%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_2                                   % Clocks                         4.2  < [32.1%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m.ALU_Op_Utilization               % Core_Execution                17.8  < [ 9.0%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m.ALU_Op_Utilization.Port_0        % Core_Clocks                   12.7  < [ 9.0%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m.ALU_Op_Utilization.Port_1        % Core_Clocks                   16.9  < [ 9.0%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m.ALU_Op_Utilization.Port_5        % Core_Clocks                   18.7  < [ 9.0%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m.ALU_Op_Utilization.Port_6        % Core_Clocks                   22.8  < [ 9.0%]
Info.Thread    IPC                                                                                             Metric                         1.40   [13.8%]
Info.System    CPU_Utilization                                                                                 Metric                         1.00   [13.8%]
Info.System    Time                                                                                            Seconds                        1.11  
Info.Core      CORE_CLKS                                                                                       Count              5,294,650,840      [13.8%]
MUX                                                                                                          %                                9.02  
andikleen commented 2 years ago

Okay I assume you will fix that?

aayasin commented 2 years ago

Yep. Please tag it with 'TMA' as i cannot do it myself

aayasin commented 2 years ago

@andikleen reminder on this one

aayasin commented 1 year ago

I tested this with TMA 4.4 on TGL and the issue is still there. I'll try again once 4.5 is released.