andikleen / pmu-tools

Intel PMU profiling tools
GNU General Public License v2.0
1.98k stars 331 forks source link

Info.Bottlenecks should not schedule events when full TMA tree is collected #449

Closed aayasin closed 1 year ago

aayasin commented 1 year ago

In a toplev command with full tree, e.g. $ toplev.py --no-desc -vl6 --nodes '+IPC,+Instructions,+Time,+SLOTS,+CLKS,+Mispredictions,+Big_Code,+Instruction_Fetch_BW,+Branching_Overhead,+DSB_Misses,+Memory_Bandwidth,+Memory_Latency,+Memory_Data_TLBs' -V m9b8IZ-x256-n8448-u01llv.toplev-vl6-perf.csv --frequency --metric-group +Summary

The events required by the Info.Bottleneck metrics should not be scheduled on their own. They are simply collected by the tree itself. Try to fit them in same group isn't useful either since the smallest such metric require too many counters by itself. As a result, toplev schedules at least 2x the needed events resulting in poor multiplexing rate as shown by this profile on ICX. Most nodes are sampled for [1.8%] of time!!

Using multiply kernel: multiply9  binary: ./workloads/mmm/m9b8IZ-x256-n8448-u01.llv
Execution time = 6.260 seconds
# 4.5-full-perf on Genuine Intel(R) CPU $0000%@ [icx/icelake]
FE             Frontend_Bound                                                                                % Slots                           18.4    [ 1.8%]
BAD            Bad_Speculation                                                                               % Slots                            7.7  < [ 1.8%]
BE             Backend_Bound                                                                                 % Slots                           19.5  < [ 1.8%]
RET            Retiring                                                                                      % Slots                           54.2  < [ 1.8%]
Info.Thread    SLOTS                                                                                           Count            1,035,193,342,312      [ 3.8%]
Info.Inst_Mix  Instructions                                                                                    Count              442,867,252,948      [ 3.8%]
FE             Frontend_Bound.Fetch_Latency                                                                  % Slots                            5.1  < [ 1.8%]
FE             Frontend_Bound.Fetch_Bandwidth                                                                % Slots                           13.4    [ 1.8%]
Info.Bottleneck Big_Code                                                                                       Scaled_Slots                     0.22   [ 3.9%]
Info.Bottleneck Instruction_Fetch_BW                                                                           Scaled_Slots                    15.17   [ 1.8%]
BAD            Bad_Speculation.Branch_Mispredicts                                                            % Slots                            7.5  < [ 1.8%]
BAD            Bad_Speculation.Machine_Clears                                                                % Slots                            0.0  < [ 1.8%]
BE/Mem         Backend_Bound.Memory_Bound                                                                    % Slots                            5.3  < [ 1.8%]
BE/Core        Backend_Bound.Core_Bound                                                                      % Slots                           13.3  < [ 1.8%]
RET            Retiring.Light_Operations                                                                     % Slots                           47.9  < [ 1.8%]
RET            Retiring.Heavy_Operations                                                                     % Slots                            6.4  < [ 1.8%]
FE             Frontend_Bound.Fetch_Latency.ICache_Misses                                                    % Clocks                           0.0  < [ 2.9%]
FE             Frontend_Bound.Fetch_Latency.ITLB_Misses                                                      % Clocks                           0.2  < [ 2.9%]
FE             Frontend_Bound.Fetch_Latency.Branch_Resteers.Unknown_Branches                                 % Clocks                           0.1  < [ 2.9%]
FE             Frontend_Bound.Fetch_Latency.Branch_Resteers                                                  % Clocks_est                       1.8  < [ 2.9%]
FE             Frontend_Bound.Fetch_Latency.DSB_Switches                                                     % Clocks                           0.2    [ 2.9%]
FE             Frontend_Bound.Fetch_Latency.LCP                                                              % Clocks                           0.0    [ 2.9%]
FE             Frontend_Bound.Fetch_Latency.MS_Switches                                                      % Clocks                           9.7    [ 2.9%]
FE             Frontend_Bound.Fetch_Latency.Branch_Resteers.Mispredicts_Resteers                             % Clocks                           2.7  < [ 2.9%]
FE             Frontend_Bound.Fetch_Latency.Branch_Resteers.Clears_Resteers                                  % Clocks                           0.0  < [ 2.5%]
Info.Bottleneck Mispredictions                                                                                 Scaled_Slots                     9.19   [ 1.8%]
FE             Frontend_Bound.Fetch_Bandwidth.MITE                                                           % Slots_est                        0.4  < [ 2.9%]
FE             Frontend_Bound.Fetch_Bandwidth.MITE.Decoder0_Alone                                            % Slots_est                        0.1  < [ 2.4%]
FE             Frontend_Bound.Fetch_Bandwidth.MITE.MITE_4wide                                                % Core_Clocks                      0.2  < [ 2.5%]
Info.Botlnk.L2 DSB_Misses                                                                                      Scaled_Slots                     0.42   [ 1.8%]
FE             Frontend_Bound.Fetch_Bandwidth.DSB                                                            % Slots_est                       16.5    [ 2.9%]<==
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound                                                           % Stalls                           3.0  < [ 6.6%]
BE/Mem         Backend_Bound.Memory_Bound.L2_Bound                                                           % Stalls                           0.8  < [ 6.6%]
BE/Mem         Backend_Bound.Memory_Bound.L3_Bound                                                           % Stalls                           1.0  < [ 6.6%]
BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound                                                         % Stalls                           0.7  < [ 6.6%]
BE/Mem         Backend_Bound.Memory_Bound.PMM_Bound                                                          % Stalls                           0.0  <
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound                                                        % Stalls                           0.0  < [ 6.6%]
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound.DTLB_Load                                                 % Clocks_est                      19.3  < [ 5.8%]
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound.DTLB_Load.Load_STLB_Hit                                   % Clocks_est                      14.4  < [ 5.8%]
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound.DTLB_Load.Load_STLB_Miss                                  % Clocks_calc                      5.0  < [ 5.8%]
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound.DTLB_Store                                             % Clocks_est                       0.6  < [ 4.5%]
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound.DTLB_Store.Store_STLB_Hit                              % Clocks_est                       0.4  < [ 4.5%]
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound.DTLB_Store.Store_STLB_Miss                             % Clocks_calc                      0.3  < [ 4.5%]
Info.Bottleneck Memory_Data_TLBs                                                                               Scaled_Slots                     2.60   [ 1.8%]
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound.Store_Fwd_Blk                                             % Clocks_est                       0.0  < [ 5.5%]
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound.Lock_Latency                                              % Clocks                           0.0  < [ 5.0%]
BE/Mem         Backend_Bound.Memory_Bound.L3_Bound.Contested_Accesses                                        % Clocks_est                      42.4  < [ 5.5%]
BE/Mem         Backend_Bound.Memory_Bound.L3_Bound.Data_Sharing                                              % Clocks_est                       3.1  < [ 5.5%]
BE/Mem         Backend_Bound.Memory_Bound.L3_Bound.SQ_Full                                                   % Clocks                           0.5  < [ 7.0%]
BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound.MEM_Bandwidth                                           % Clocks                           3.0  < [ 8.1%]
BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound.MEM_Latency                                             % Clocks                           8.2  < [ 8.1%]
BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound.MEM_Latency.Remote_Cache                                % Clocks_est                      11.6  < [ 2.1%]
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound.Store_Latency                                          % Clocks_est                       0.6  < [ 5.0%]
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound.False_Sharing                                          % Clocks_est                       0.6  < [ 4.1%]
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound.Streaming_Stores                                       % Clocks_est                       0.0  < [ 4.1%]
Info.Bottleneck Memory_Bandwidth                                                                               Scaled_Slots                     0.08   [ 1.8%]
Info.Bottleneck Memory_Latency                                                                                 Scaled_Slots                     2.12   [ 1.8%]
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound.Split_Loads                                               % Clocks_calc                      0.0  < [ 5.0%]
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound.4K_Aliasing                                               % Clocks_est                       0.9  < [ 5.5%]
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound.FB_Full                                                   % Clocks_calc                      0.8  < [ 6.6%]
BE/Mem         Backend_Bound.Memory_Bound.L3_Bound.L3_Hit_Latency                                            % Clocks_est                     100.0  <
BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound.MEM_Latency.Local_DRAM                                  % Clocks_est                      42.5  < [ 2.2%]
BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound.MEM_Latency.Remote_DRAM                                 % Clocks_est                     100.0  <
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound.Split_Stores                                           % Core_Utilization                 0.0  < [ 4.5%]
BE/Core        Backend_Bound.Core_Bound.Divider                                                              % Clocks                           0.0  < [ 3.1%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization                                                    % Clocks                          34.7  < [ 1.8%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0                                   % Clocks                          17.6  < [ 2.4%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0.Serializing_Operation             % Clocks                          23.8  < [ 2.2%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_1                                   % Clocks                           8.3  < [ 2.6%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_2                                   % Clocks                          15.2  < [ 2.6%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m                                  % Clocks                          61.0  < [ 2.8%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0.Serializing_Operation.Slow_Pause  % Clocks                           7.1  < [ 3.8%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0.Mixing_Vectors                    % Clocks                         100.0
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m.ALU_Op_Utilization               % Core_Execution                  42.6  < [ 2.2%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m.ALU_Op_Utilization.Port_0        % Core_Clocks                     43.2  < [ 2.3%]
RET            Retiring.Light_Operations.FP_Arith.X87_Use                                                    % Uops                             0.0  < [ 1.8%]
RET            Retiring.Light_Operations.FP_Arith.FP_Scalar                                                  % Uops                             0.0  < [ 1.8%]
RET            Retiring.Light_Operations.FP_Arith.FP_Vector                                                  % Uops                            19.7  < [ 1.8%]
RET            Retiring.Light_Operations.FP_Arith.FP_Vector.FP_Vector_128b                                   % Uops                             0.0  < [ 1.8%]
RET            Retiring.Light_Operations.FP_Arith.FP_Vector.FP_Vector_256b                                   % Uops                            19.1  < [ 1.8%]
RET            Retiring.Light_Operations.FP_Arith.FP_Vector.FP_Vector_512b                                   % Uops                             0.0  < [ 1.8%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m.ALU_Op_Utilization.Port_1        % Core_Clocks                     45.2  < [ 2.3%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m.ALU_Op_Utilization.Port_5        % Core_Clocks                     29.8  < [ 2.3%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m.ALU_Op_Utilization.Port_6        % Core_Clocks                     48.9  < [ 2.3%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m.Load_Op_Utilization              % Core_Execution                  33.1  < [ 2.2%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m.Store_Op_Utilization             % Core_Execution                  17.5  < [ 2.1%]
RET            Retiring.Light_Operations.FP_Arith                                                            % Uops                            19.7  < [ 1.8%]
Info.System    CPU_Utilization                                                                                 Metric                           0.52   [ 3.8%]
RET            Retiring.Light_Operations.Memory_Operations                                                   % Slots                           22.9  < [ 1.8%]
RET            Retiring.Light_Operations.Branch_Instructions                                                 % Slots                            7.0  < [ 1.8%]
RET            Retiring.Light_Operations.Nop_Instructions                                                    % Slots                            0.1  < [ 1.8%]
RET            Retiring.Light_Operations.Other_Light_Ops                                                     % Slots                            0.0  <
Info.Thread    CLKS                                                                                            Count              207,432,001,750      [ 3.8%]
RET            Retiring.Heavy_Operations.Few_Uops_Instructions                                               % Slots                            5.5  < [ 1.8%]
RET            Retiring.Heavy_Operations.Microcode_Sequencer                                                 % Slots                            0.3  < [ 1.8%]
RET            Retiring.Heavy_Operations.Microcode_Sequencer.Assists                                         % Slots_est                        0.0  < [ 1.8%]
RET            Retiring.Heavy_Operations.Microcode_Sequencer.CISC                                            % Slots                            0.3  < [ 1.8%]
Info.Bottleneck Branching_Overhead                                                                             Scaled_Slots                     8.21   [ 1.9%]
Info.Thread    IPC                                                                                             Metric                           2.13   [ 3.8%]
Info.Botlnk.L0 Core_Bound_Likely                                                                               Metric                           0.00
Info.System    Time                                                                                            Seconds                          7.03
MUX                                                                                                          %                                  1.79
Frequency                                                                                                      CoreMetric                       0.68   [ 1.9%]

This enhancement shall guide toplev to not schedule an event if the metrics Key column in the TMA-metrics spreadsheet start with "Info.Bot".