andikleen / pmu-tools

Intel PMU profiling tools
GNU General Public License v2.0
1.98k stars 331 forks source link

toplev: Mixing_Vectors is incorrect for fp-mul-lat kernel #421

Closed aayasin closed 1 year ago

aayasin commented 2 years ago

Using a simple fp-mul-lat, toplev seems to miss-calculate the metric value, as perf stat correctly reports zero. Below is a reproducor. A $-prefixed line is the command to run, with its output in the lines that follow it.

Clone and build the kernel

$ git clone --recurse-submodules https://github.com/aayasin/perf-tools
$ cd perf-tools/kernels
$ ./build.sh
$ cd ..

perf stat with the two events for the metric show no counts for UOPS_ISSUED.VECTOR_WIDTH_MISMATCH

$ ./do.py profile -a './kernels/fp-mul-lat 8000000' -e r020e:UOPS_ISSUED.VECTOR_WIDTH_MISMATCH,r010e:UOPS_ISSUED.ANY -pm 42 -v3
per-app counting ..
perf stat -r3 -e "cpu-clock,context-switches,cpu-migrations,page-faults,instructions,cycles,ref-cycles,branches,branch-misses,{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound},cpu/event=0x0e,umask=0x02,name=UOPS_ISSUED.VECTOR_WIDTH_MISMATCH/,cpu/event=0x0e,umask=0x01,name=UOPS_ISSUED.ANY/" -- ./kernels/fp-mul-lat 8000000 2>&1 | tee fp-mul-lat-8000000.perf_stat-r3.log                                                                                                                                                                                  
Reference: A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-Based Processors. Ahmad Yasin, Jawad Haj-Yahya, Yosi Ben-Asher, Avi Mendelson. TACO 2019 and HiPEAC 2020.
Reference: A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-Based Processors. Ahmad Yasin, Jawad Haj-Yahya, Yosi Ben-Asher, Avi Mendelson. TACO 2019 and HiPEAC 2020.
Reference: A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-Based Processors. Ahmad Yasin, Jawad Haj-Yahya, Yosi Ben-Asher, Avi Mendelson. TACO 2019 and HiPEAC 2020.

 Performance counter stats for './kernels/fp-mul-lat 8000000' (3 runs):

          2,244.16 msec cpu-clock                 #    0.999 CPUs utilized            ( +-  2.57% )
                 2      context-switches          #    1.040 /sec                     ( +- 14.29% )
                 0      cpu-migrations            #    0.000 /sec                   
                53      page-faults               #   23.765 /sec                     ( +-  1.65% )
     1,307,042,884      instructions              #    0.26  insn per cycle           ( +-  0.00% )
     5,123,688,215      cycles                    #    2.283 GHz                      ( +-  0.01% )
     4,488,299,040      ref-cycles                #    2.000 G/sec                    ( +-  2.57% )
         8,536,789      branches                  #    3.804 M/sec                    ( +-  0.08% )
             9,378      branch-misses             #    0.11% of all branches          ( +-  0.34% )
    25,618,302,518      slots                     #   11.416 G/sec                    ( +-  0.01% )
     1,205,567,177      topdown-retiring          #      4.7% retiring                ( +-  0.01% )
       200,927,862      topdown-bad-spec          #      0.8% bad speculation         ( +-  0.01% )
           311,940      topdown-fe-bound          #      0.0% frontend bound          ( +-100.00% )
    24,211,807,478      topdown-be-bound          #     94.5% backend bound           ( +-  0.01% )
                29      UOPS_ISSUED.VECTOR_WIDTH_MISMATCH #   12.922 /sec                     ( +- 26.03% )
     1,300,768,002      UOPS_ISSUED.ANY           #  579.623 M/sec                    ( +-  0.01% )

            2.2461 +- 0.0577 seconds time elapsed  ( +-  2.57% )

toplev --autodrill down incorrectly shows 100% Mixing Vectors. Note I also reproduce it with "-vl6 --no-multiplex".

topdown auto-drilldown ..
/usr/bin/python ./pmu-tools/toplev.py --no-desc --drilldown --show-sample -l1 --nodes '+IPC,+Heavy_Operations,+Time' -V fp-mul-lat-8000000.toplev--drilldown-perf.csv --metric-group +Summary --perf -v -- ./kernels/fp-mul-lat 8000000 2>&1 | tee fp-mul-lat-8000000.toplev--drilldown.log | egrep -v "^(Run toplev|Add|Using|Sampling|perf record)"                                         
perf stat -x\; -e '{cpu/event=0xc0,umask=0x0/,cpu/event=0x3c,umask=0x0/,cpu/event=0x0,umask=0x3/,cpu/event=0xd,umask=0x10/,cpu/event=0xd,umask=0x1,cmask=1,edge=1/,cpu/event=0x56,umask=0x1/,cpu/event=0x56,umask=0x1,cmask=1/,cpu/event=0x79,umask=0x4/},dummy,{cpu/event=0xe,umask=0x1/,cpu/event=0x79,umask=0x30/},msr/tsc/,duration_time,{slots,topdown-be-bound,topdown-bad-spec,topdown-fe-bound,topdown-retiring}' ./kernels/fp-mul-lat 8000000
Reference: A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-Based Processors. Ahmad Yasin, Jawad Haj-Yahya, Yosi Ben-Asher, Avi Mendelson. TACO 2019 and HiPEAC 2020.
# 4.3-full-perf on Genuine Intel(R) CPU $0000%@ [icx/icelake]
FE             Frontend_Bound             % Slots                       0.0  <
BAD            Bad_Speculation            % Slots                       0.4  <
BE             Backend_Bound              % Slots                      94.9   <==
RET            Retiring                   % Slots                       4.7  <
Info.Inst_Mix  Instructions                 Count           1,306,199,080     
RET            Retiring.Heavy_Operations  % Slots                       0.0  <
Info.Thread    IPC                          Metric                      0.25  
Info.System    CPU_Utilization              Metric                      1.00  
Info.System    Time                         Seconds                     1.41  
MUX                                       %                           100.00  
Rerunning workload
perf stat -x\; -e '{cpu/event=0xc0,umask=0x0/,cpu/event=0x3c,umask=0x0/,cpu/event=0xd,umask=0x1,cmask=1,edge=1/,cpu/event=0xa3,umask=0x14,cmask=20/,cpu/event=0xa6,umask=0x40,cmask=2/,cpu/event=0xa3,umask=0x4,cmask=4/,cpu/event=0xa6,umask=0x2/,cpu/event=0xa6,umask=0x4/},duration_time,{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound},{cpu/event=0x56,umask=0x1/,cpu/event=0x56,umask=0x1,cmask=1/,cpu/event=0x79,umask=0x4/},dummy,{cpu/event=0xe,umask=0x1/,cpu/event=0x79,umask=0x30/}' ./kernels/fp-mul-lat 8000000
Reference: A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-Based Processors. Ahmad Yasin, Jawad Haj-Yahya, Yosi Ben-Asher, Avi Mendelson. TACO 2019 and HiPEAC 2020.
BE             Backend_Bound               % Slots                      94.5    [48.0%]
BE/Mem         Backend_Bound.Memory_Bound  % Slots                       0.0  < [48.0%]
BE/Core        Backend_Bound.Core_Bound    % Slots                      94.5    [48.0%]<==
RET            Retiring.Heavy_Operations   % Slots                       0.4  < [52.0%]
Info.Thread    IPC                           Metric                      0.26   [48.0%]
Info.System    Time                          Seconds                     2.31  
MUX                                        %                            47.95  
Rerunning workload
perf stat -x\; -e '{cpu/event=0xc0,umask=0x0/,cpu/event=0x3c,umask=0x0/,cpu/event=0xd,umask=0x1,cmask=1,edge=1/,cpu/event=0xa3,umask=0x14,cmask=20/,cpu/event=0xa6,umask=0x40,cmask=2/,cpu/event=0xa3,umask=0x4,cmask=4/,cpu/event=0xa6,umask=0x2/,cpu/event=0xa6,umask=0x4/,cpu/event=0x14,umask=0x9,cmask=1/},duration_time,{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound},{cpu/event=0x56,umask=0x1/,cpu/event=0x56,umask=0x1,cmask=1/,cpu/event=0x79,umask=0x4/},dummy,{cpu/event=0xe,umask=0x1/,cpu/event=0x79,umask=0x30/,cpu/event=0x14,umask=0x9,cmask=1/,cpu/event=0x3c,umask=0x0/}' ./kernels/fp-mul-lat 8000000
Reference: A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-Based Processors. Ahmad Yasin, Jawad Haj-Yahya, Yosi Ben-Asher, Avi Mendelson. TACO 2019 and HiPEAC 2020.
BE             Backend_Bound                               % Slots                      94.5    [24.3%]
BE/Mem         Backend_Bound.Memory_Bound                  % Slots                       0.0  < [24.3%]
BE/Core        Backend_Bound.Core_Bound                    % Slots                      94.5    [24.3%]
RET            Retiring.Heavy_Operations                   % Slots                       0.3  < [52.2%]
BE/Core        Backend_Bound.Core_Bound.Divider            % Clocks                      0.0  < [75.7%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization  % Clocks                    100.0    [24.3%]<==
Info.Thread    IPC                                           Metric                      0.26   [24.3%]
Info.System    Time                                          Seconds                     1.44  
MUX                                                        %                            24.31  
Rerunning workload
perf stat -x\; -e '{cpu/event=0xc0,umask=0x0/,cpu/event=0x3c,umask=0x0/,cpu/event=0xd,umask=0x1,cmask=1,edge=1/,cpu/event=0xa3,umask=0x14,cmask=20/,cpu/event=0xa6,umask=0x40,cmask=2/,cpu/event=0xa3,umask=0x4,cmask=4/,cpu/event=0xa6,umask=0x2/,cpu/event=0xa6,umask=0x4/,cpu/event=0x14,umask=0x9,cmask=1/,cpu/event=0xb1,umask=0x1,cmask=3/},duration_time,{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound},{cpu/event=0x56,umask=0x1/,cpu/event=0x56,umask=0x1,cmask=1/,cpu/event=0x79,umask=0x4/},dummy,{cpu/event=0xe,umask=0x1/,cpu/event=0x79,umask=0x30/,cpu/event=0x14,umask=0x9,cmask=1/,cpu/event=0x3c,umask=0x0/,cpu/event=0xa6,umask=0x2/,cpu/event=0xa6,umask=0x4/}' ./kernels/fp-mul-lat 8000000
Reference: A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-Based Processors. Ahmad Yasin, Jawad Haj-Yahya, Yosi Ben-Asher, Avi Mendelson. TACO 2019 and HiPEAC 2020.
BE             Backend_Bound                                                 % Slots                      94.5    [21.2%]
BE/Mem         Backend_Bound.Memory_Bound                                    % Slots                       0.0  < [21.2%]
BE/Core        Backend_Bound.Core_Bound                                      % Slots                      94.5    [21.2%]
RET            Retiring.Heavy_Operations                                     % Slots                       0.3  < [48.9%]
BE/Core        Backend_Bound.Core_Bound.Divider                              % Clocks                      0.0  < [78.8%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization                    % Clocks                    100.0    [21.2%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0   % Clocks                     74.7    [21.2%]<==
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_1   % Clocks                     25.3    [78.8%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_2   % Clocks                      0.0  < [78.8%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m  % Clocks                      0.0  < [21.2%]
Info.Thread    IPC                                                             Metric                      0.26   [21.2%]
Info.System    Time                                                            Seconds                     1.44  
MUX                                                                          %                            21.21  
Rerunning workload
perf stat -x\; -e '{cpu/event=0xc0,umask=0x0/,cpu/event=0x3c,umask=0x0/,cpu/event=0xd,umask=0x1,cmask=1,edge=1/,cpu/event=0xa3,umask=0x14,cmask=20/,cpu/event=0xa6,umask=0x40,cmask=2/,cpu/event=0xa3,umask=0x4,cmask=4/,cpu/event=0xa6,umask=0x2/,cpu/event=0xa6,umask=0x4/,cpu/event=0x14,umask=0x9,cmask=1/,cpu/event=0xb1,umask=0x1,cmask=3/},duration_time,{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound},{cpu/event=0x56,umask=0x1/,cpu/event=0x56,umask=0x1,cmask=1/,cpu/event=0x79,umask=0x4/},dummy,{cpu/event=0xe,umask=0x1/,cpu/event=0x79,umask=0x30/,cpu/event=0x14,umask=0x9,cmask=1/,cpu/event=0x3c,umask=0x0/,cpu/event=0xa6,umask=0x2/,cpu/event=0xa6,umask=0x4/},{cpu/event=0xa2,umask=0x2/,cpu/event=0x3c,umask=0x0/,cpu/event=0xe,umask=0x2/,cpu/event=0xe,umask=0x1/}' ./kernels/fp-mul-lat 8000000
Reference: A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-Based Processors. Ahmad Yasin, Jawad Haj-Yahya, Yosi Ben-Asher, Avi Mendelson. TACO 2019 and HiPEAC 2020.
BE             Backend_Bound                                                                      % Slots                      94.5    [17.3%]
BE/Mem         Backend_Bound.Memory_Bound                                                         % Slots                       0.0  < [17.3%]
BE/Core        Backend_Bound.Core_Bound                                                           % Slots                      94.5    [17.3%]
RET            Retiring.Heavy_Operations                                                          % Slots                       0.3  < [38.4%]
BE/Core        Backend_Bound.Core_Bound.Divider                                                   % Clocks                      0.0  < [63.2%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization                                         % Clocks                     99.9    [17.3%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0                        % Clocks                     74.6    [17.3%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_1                        % Clocks                     25.3    [63.2%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_2                        % Clocks                      0.0  < [63.2%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_3m                       % Clocks                      0.0  < [17.3%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0.Serializing_Operation  % Clocks                      0.0  < [19.5%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0.Mixing_Vectors         % Clocks                    100.0   <==
Info.Thread    IPC                                                                                  Metric                      0.26   [17.3%]
Info.System    Time                                                                                 Seconds                     1.42  
MUX                                                                                               %                            17.34  

The actual perf stat command that toplev invokes in previous step shows no counts for r020e

$ perf stat -e '{cpu/event=0xc0,umask=0x0/,cpu/event=0x3c,umask=0x0/,cpu/event=0xd,umask=0x1,cmask=1,edge=1/,cpu/event=0xa3,umask=0x14,cmask=20/,cpu/event=0xa6,umask=0x40,cmask=2/,cpu/event=0xa3,umask=0x4,cmask=4/,cpu/event=0xa6,umask=0x2/,cpu/event=0xa6,umask=0x4/,cpu/event=0x14,umask=0x9,cmask=1/,cpu/event=0xb1,umask=0x1,cmask=3/},duration_time,{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound},{cpu/event=0x56,umask=0x1/,cpu/event=0x56,umask=0x1,cmask=1/,cpu/event=0x79,umask=0x4/},dummy,{cpu/event=0xe,umask=0x1/,cpu/event=0x79,umask=0x30/,cpu/event=0x14,umask=0x9,cmask=1/,cpu/event=0x3c,umask=0x0/,cpu/event=0xa6,umask=0x2/,cpu/event=0xa6,umask=0x4/},{cpu/event=0xa2,umask=0x2/,cpu/event=0x3c,umask=0x0/,cpu/event=0xe,umask=0x2/,cpu/event=0xe,umask=0x1/}' ./kernels/fp-mul-lat 8000000
Reference: A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-Based Processors. Ahmad Yasin, Jawad Haj-Yahya, Yosi Ben-Asher, Avi Mendelson. TACO 2019 and HiPEAC 2020.

 Performance counter stats for './kernels/fp-mul-lat 8000000':

     1,278,686,588      cpu/event=0xc0,umask=0x0/                                     (21.53%)
     5,005,738,491      cpu/event=0x3c,umask=0x0/                                     (21.53%)
            45,257      cpu/event=0xd,umask=0x1,cmask=1,edge=1/                                     (21.53%)
         1,036,203      cpu/event=0xa3,umask=0x14,cmask=20/                                     (21.53%)
           130,543      cpu/event=0xa6,umask=0x40,cmask=2/                                     (21.53%)
     3,737,271,923      cpu/event=0xa3,umask=0x4,cmask=4/                                     (21.53%)
     1,266,376,035      cpu/event=0xa6,umask=0x2/                                     (21.53%)
           618,631      cpu/event=0xa6,umask=0x4/                                     (21.53%)
            89,437      cpu/event=0x14,umask=0x9,cmask=1/                                     (21.53%)
         1,430,177      cpu/event=0xb1,umask=0x1,cmask=3/                                     (21.53%)
     2,196,682,992 ns   duration_time                                               
    25,301,512,188      slots                                                         (44.30%)
     1,190,659,395      topdown-retiring          #      4.7% retiring                (44.30%)
       198,443,230      topdown-bad-spec          #      0.8% bad speculation         (44.30%)
         2,568,571      topdown-fe-bound          #      0.0% frontend bound          (44.30%)
    23,912,409,558      topdown-be-bound          #     94.5% backend bound           (44.30%)
           387,465      cpu/event=0x56,umask=0x1/                                     (42.04%)
           201,009      cpu/event=0x56,umask=0x1,cmask=1/                                     (42.04%)
         2,554,181      cpu/event=0x79,umask=0x4/                                     (42.04%)
                 0      dummy                                                       
     1,296,246,089      cpu/event=0xe,umask=0x1/                                      (60.25%)
           767,027      cpu/event=0x79,umask=0x30/                                     (60.25%)
            82,659      cpu/event=0x14,umask=0x9,cmask=1/                                     (60.25%)
     5,108,836,255      cpu/event=0x3c,umask=0x0/                                     (60.25%)
     1,292,764,134      cpu/event=0xa6,umask=0x2/                                     (60.25%)
           247,762      cpu/event=0xa6,umask=0x4/                                     (60.25%)
         1,087,762      cpu/event=0xa2,umask=0x2/                                     (18.22%)
     5,310,062,901      cpu/event=0x3c,umask=0x0/                                     (18.22%)
                 0      cpu/event=0xe,umask=0x2/                                      (18.22%)
     1,346,732,454      cpu/event=0xe,umask=0x1/                                      (18.22%)

       2.196682992 seconds time elapsed

       2.195640000 seconds user
       0.000000000 seconds sys
andikleen commented 2 years ago

It's because of self.val = min(self.val, 1)

I guess that's a left over from when the models were generated with that. Probably I forgot some models when i fixed it up. Will regenerate them all.

andikleen commented 1 year ago

Should have been fixed with the TMA 4.5 update