andikleen / pmu-tools

Intel PMU profiling tools
GNU General Public License v2.0
1.98k stars 331 forks source link

toplev -r3 does not always show drift in output #436

Open aayasin opened 2 years ago

aayasin commented 2 years ago

This one might have been hiding there also before. Basically, -rN when N>1 does not print the +- drift always. Reproducer on ICX. Relevant setup details included too.

Good run where +- 0.5 is printed next to Fetch_Latency for example

$ time ./pmu-tools/toplev.py -r3 --no-desc --perf  -- taskset 0x4 ./CLTRAMP3D
perf stat -x\; -e '{cpu/slots/,cpu/topdown-be-bound/,cpu/topdown-bad-spec/,cpu/topdown-fe-bound/,cpu/topdown-retiring/},{cpu/event=0xd,umask=0x10/,cpu/event=0xd,umask=0x1,cmask=1,edge=1/,cpu/slots/,cpu/event=0x9c,umask=0x1,cmask=5/,cpu/event=0xc0,umask=0x0/,cpu/event=0x3c,umask=0x0/,cpu/event=0xc5,umask=0x0/,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/,cpu/event=0x56,umask=0x1/,cpu/event=0x56,umask=0x1,cmask=1/,cpu/event=0x79,umask=0x4/},{cpu/event=0xa3,umask=0x14,cmask=20/,cpu/event=0xa6,umask=0x40,cmask=2/,cpu/event=0xa3,umask=0x4,cmask=4/,cpu/event=0xa6,umask=0x2/,cpu/event=0xa6,umask=0x4/,cpu/event=0xd,umask=0x1,cmask=1,edge=1/},dummy,{cpu/event=0xe,umask=0x1/,cpu/event=0x79,umask=0x30/}' --percore-show-thread -r3 -- taskset 0x4 ./CLTRAMP3D
# 4.4-full-perf on Genuine Intel(R) CPU $0000%@ [icx/icelake]
FE             Frontend_Bound                      % Slots                      43.7   [23.9%] +-      0.8
BAD            Bad_Speculation                     % Slots                      16.6   [23.9%] +-      0.9
FE             Frontend_Bound.Fetch_Latency        % Slots                      28.5   [23.9%] +-      0.5<==
FE             Frontend_Bound.Fetch_Bandwidth      % Slots                      15.2   [23.9%] +-      0.9
BAD            Bad_Speculation.Branch_Mispredicts  % Slots                      16.1   [23.9%] +-      1.0
MUX                                                %                            23.91
Run toplev --describe Fetch_Latency^ to get more information on bottleneck
Add --run-sample to find locations
Add --nodes '!+Fetch_Latency*/3,+Frontend_Bound,+MUX' for breakdown.

real    0m14.000s
user    0m12.449s
sys     0m1.432s

Bad run where the drift is omitted. Only diff is adding --frequency to toplev

admin1@icx-srv03:~/ayasin/perf-tools$ time ./pmu-tools/toplev.py -r3 --no-desc --perf --frequency -- taskset 0x4 ./CLTRAMP3D
Will measure complete system.
perf stat -x\; -e '{cycles,cpu/event=0x0,umask=0x3/,cpu/event=0xd,umask=0x10/,cpu/event=0xd,umask=0x1,cmask=1,edge=1/,cpu/slots/,cpu/event=0x9c,umask=0x1,cmask=5/,cpu/event=0xa3,umask=0x14,cmask=20/,cpu/event=0xa6,umask=0x40,cmask=2/,cpu/event=0xa3,umask=0x4,cmask=4/,cpu/event=0xa6,umask=0x2/,cpu/event=0xa6,umask=0x4/},{cpu/slots/,cpu/topdown-be-bound/,cpu/topdown-bad-spec/,cpu/topdown-fe-bound/,cpu/topdown-retiring/},dummy,dummy,{cpu/event=0xd,umask=0x10/,cpu/event=0xc0,umask=0x0/,cpu/event=0x3c,umask=0x0/,cpu/event=0xd,umask=0x1,cmask=1,edge=1/,cpu/event=0xc5,umask=0x0/,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/,cpu/event=0x56,umask=0x1/,cpu/event=0x56,umask=0x1,cmask=1/,cpu/event=0x79,umask=0x4/},dummy,{cpu/event=0xe,umask=0x1/,cpu/event=0x79,umask=0x30/},{cpu/event=0x56,umask=0x1/,cpu/event=0x56,umask=0x1,cmask=1/,cpu/event=0x79,umask=0x4/}' -A -a --percore-show-thread -r3 -- taskset 0x4 ./CLTRAMP3D
# 4.4-full-perf on Genuine Intel(R) CPU $0000%@ [icx/icelake]
S0-C2     Frequency                        CoreMetric                  1.73  [20.2%]
S0-C2-T0  FE             Frontend_Bound                  % Slots                      45.4   [20.7%]<==
S0-C2-T0  FE             Frontend_Bound.Fetch_Latency    % Slots                      24.0   [20.7%]
S0-C2-T0  FE             Frontend_Bound.Fetch_Bandwidth  % Slots                      21.4   [20.7%]
S0-C2-T0  MUX                                            %                            19.80
Run toplev --describe Frontend_Bound^ to get more information on bottleneck
Add --run-sample to find locations
Add --nodes '!+Frontend_Bound*/2,+MUX' for breakdown.
Idle CPUs 0-1,3-159 may have been hidden. Override with --idle-threshold 100

real    0m14.367s
user    0m12.757s
sys     0m1.678s

Setup details

$ cat setup-system.log | egrep -i 'version|PMU|Linux|PMU'
Linux icx-srv03 5.11.0 #1 SMP Wed Mar 3 17:33:12 IST 2021 x86_64 x86_64 x86_64 GNU/Linux
VERSION="18.04.4 LTS (Bionic Beaver)"
VERSION_ID="18.04"
[Tue Aug  2 17:44:16 2022] Performance Events: PEBS fmt4+-baseline,  AnyThread deprecated, Icelake events, 32-deep LBR, full-width counters, Intel PMU driver.
PMU: icelake
perf version 5.15.g8bb7eca972ad
python version: 2.7.17
andikleen commented 2 years ago

It seems to be a bug in perf, it doesn't output the variability in this case

perf stat -e '{cpu/event=0x9c,umask=0x1/,cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xc2,umask=0x2/,cpu/event=0xe,umask=0x1/,cpu/event=0xd,umask=0x1,any=1/}' -A -a -r3 -- taskset 0x4 ./workloads/CLTRAMP3D

CPU0 30,569,205 cpu/event=0x9c,umask=0x1/
CPU1 9,104,879 cpu/event=0x9c,umask=0x1/
CPU2 33,943,285,980 cpu/event=0x9c,umask=0x1/
CPU3 1,801,127 cpu/event=0x9c,umask=0x1/
...

It works if I remove the -a -A