andikleen / pmu-tools

Intel PMU profiling tools
GNU General Public License v2.0
1.97k stars 330 forks source link

toplev --run-sample: A requested CPU in '0' is not supported by PMU 'cpu_atom' (CPUs 8-23) for event 'cycles:pp' #500

Open thetheodor opened 8 months ago

thetheodor commented 8 months ago

Hi, I'm trying to use toplev on a Raptor Lake system, if I use it with --run-sample it tries to use :pp events, e.g.,:

perf record -g -e cpu_core/event=0xc6,umask=0x1,frontend=0x8,name=MS_Switches_FRONTEND_RETIRED_MS_FLOWS,period=100007/,cpu_core/event=0xc3,umask=0x1,edge=1,cmask=1,name=Machine_Clears_MACHINE_CLEARS_COUNT,period=100003/,cpu_core/event=0xd0,umask=0x82,name=Store_Bound_MEM_INST_RETIRED_ALL_STORES,period=1000003/pp,cpu_core/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_RETIRED_FB_HIT,period=100007/pp,cpu_core/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_RETIRED_L1_HIT,period=1000003/pp,cpu_core/event=0xa4,umask=0x2,name=Backend_Bound_TOPDOWN_BACKEND_BOUND_SLOTS,period=10000003/,cpu_core/event=0xc2,umask=0x4,frontend=0x8,name=Microcode_Sequencer_UOPS_RETIRED_MS,period=2000003/,cycles:pp -o perf.data -C 0 taskset -c 0 ./run.sh

WARNING: A requested CPU in '0' is not supported by PMU 'cpu_atom' (CPUs 8-23) for event 'cycles:pp'
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cpu_atom/cycles:pp/).
/bin/dmesg | grep -i perf may provide additional information.
sudo /bin/dmesg | grep -i perf
[    0.167301] Performance Events: XSAVE Architectural LBR, PEBS fmt4+-baseline,  AnyThread deprecated, Alderlake Hybrid events, 32-deep LBR, full-width counters, Intel PMU driver.
[    0.175202] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[12747.269426] perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500

the problem seems to be:

perf stat -e cycles:pp sleep 1                                                                         

 Performance counter stats for 'sleep 1':

   <not supported>      cpu_core/cycles:pp/
   <not supported>      cpu_atom/cycles:pp/

       1.010279365 seconds time elapsed

       0.000000000 seconds user
       0.001284000 seconds sys
perf stat -e cycles sleep 1                                                                            

 Performance counter stats for 'sleep 1':

     <not counted>      cpu_core/cycles/                                                        (0.00%)
         1,486,577      cpu_atom/cycles/

       1.001472195 seconds time elapsed

       0.001138000 seconds user
       0.000000000 seconds sys

Kernel: 6.6.2 perf -v: 6.5

andikleen commented 8 months ago

Can you give the full toplev command line?

The PEBS events don't necessarily count in perf stat.

The problem seems to be this perf error message: WARNING: A requested CPU in '0' is not supported by PMU 'cpu_atom' (CPUs 8-23) for event 'cycles:pp'

cycles:pp should really work so that's some kind of upstream perf bug. Does a plain perf record -e cycles:pp ./run.sh

export HYPERVISOR=1 should work around it (will disable PEBS, but also some other features)

thetheodor commented 7 months ago

Thanks for the reply.

Can you give the full toplev command line?

~/pmu-tools/toplev.py --core S0-C0 -l3 --run-sample --no-desc taskset -c 0 ./run.sh

Playing a bit more with it:

~/pmu-tools/toplev.py --core S0-C0 -l3 --run-sample --no-desc taskset -c 0 sleep 1
# 4.7-full on Intel(R) Core(TM) i9-14900K [adl]
core FE               Frontend_Bound                                  % Slots                       43.4   [11.0%]
core BE               Backend_Bound                                   % Slots                       31.8   [22.0%]
core FE               Frontend_Bound.Fetch_Latency                    % Slots                       34.7   [22.0%]
core BAD              Bad_Speculation.Machine_Clears                  % Slots                        0.8   [11.0%]
core BE/Core          Backend_Bound.Core_Bound                        % Slots                       21.2   [22.0%]
core FE               Frontend_Bound.Fetch_Latency.ICache_Misses      % Clocks                      17.1   [22.0%]
core FE               Frontend_Bound.Fetch_Latency.ITLB_Misses        % Clocks                       6.5   [22.0%]
core FE               Frontend_Bound.Fetch_Latency.Branch_Resteers    % Clocks                      32.4   [11.0%]<==
core FE               Frontend_Bound.Fetch_Latency.MS_Switches        % Clocks_est                  11.5   [11.0%]
core BE/Mem           Backend_Bound.Memory_Bound.L1_Bound             % Stalls                       5.9   [11.0%]
core BE/Core          Backend_Bound.Core_Bound.Serializing_Operation  % Clocks                      35.1   [11.0%]
core BE/Core          Backend_Bound.Core_Bound.Ports_Utilization      % Clocks                      45.2   [11.0%]
core RET              Retiring.Heavy_Operations.Microcode_Sequencer   % Slots                        2.7   [11.0%]
core MUX                                                              %                             11.00
Run toplev --describe Branch_Resteers^ to get more information on bottleneck for core
Add --nodes '!+Branch_Resteers*/4,+Frontend_Bound.Fetch_Latency,+Frontend_Bound,+MUX' for breakdown.
Sampling:
perf record -g -e cpu_core/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/,cpu_core/event=0xc6,umask=0x1,frontend=0x14,name=ITLB_Misses_FRONTEND_RETIRED_ITLB_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x12,name=ICache_Misses_FRONTEND_RETIRED_L1I_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x13,name=ICache_Misses_FRONTEND_RETIRED_L2_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x601006,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_16,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600406,name=Frontend_Bound_FRONTEND_RETIRED_LATENCY_GE_4,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600806,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_8,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x8,name=MS_Switches_FRONTEND_RETIRED_MS_FLOWS,period=100007/,cpu_core/event=0xc6,umask=0x1,frontend=0x15,name=ITLB_Misses_FRONTEND_RETIRED_STLB_MISS,period=100007/pp,cpu_core/event=0xc3,umask=0x1,edge=1,cmask=1,name=Machine_Clears_MACHINE_CLEARS_COUNT,period=100003/,cpu_core/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_RETIRED_FB_HIT,period=100007/pp,cpu_core/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_RETIRED_L1_HIT,period=1000003/pp,cpu_core/event=0xa2,umask=0x2,name=Serializing_Operation_RESOURCE_STALLS_SCOREBOARD,period=100003/,cpu_core/event=0xa4,umask=0x2,name=Backend_Bound_TOPDOWN_BACKEND_BOUND_SLOTS,period=10000003/,cpu_core/event=0xc2,umask=0x4,frontend=0x8,name=Microcode_Sequencer_UOPS_RETIRED_MS,period=2000003/,cycles:pp -o perf.data -C 0 taskset -c 0 sleep 1
WARNING: A requested CPU in '0' is not supported by PMU 'cpu_atom' (CPUs 8-23) for event 'cycles:pp'
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cpu_atom/cycles:pp/).
/bin/dmesg | grep -i perf may provide additional information.

Sampling failed

but if I remove the --core S0-C0 part it works:

~/pmu-tools/toplev.py  -l3 --run-sample --no-desc taskset -c 0 sleep 1
70 events not counted
# 4.7-full, 3.51 on Intel(R) Core(TM) i9-14900K [adl]
core FE               Frontend_Bound                              % Slots                       42.0
core BE               Backend_Bound                               % Slots                       26.3   [28.0%]
core FE               Frontend_Bound.Fetch_Latency                % Slots                       28.5   [28.0%]
core BE/Core          Backend_Bound.Core_Bound                    % Slots                       15.3   [28.0%]
core FE               Frontend_Bound.Fetch_Latency.ICache_Misses  % Clocks                      12.6   [75.0%]<==
core FE               Frontend_Bound.Fetch_Latency.ITLB_Misses    % Clocks                       6.0   [75.0%]
warning: 16 nodes had zero counts: Branch_Resteers DRAM_Bound DSB DSB_Switches Divider L1_Bound L2_Bound L3_Bound LSD MITE MS_Switches Other_Mispredicts Other_Nukes Ports_Utilization Serializing_Operation Store_Bound
atom FE               Frontend_Bound                              % Slots                       34.0   [28.0%]<==
atom FE               Frontend_Bound.Fetch_Latency                % Slots                       16.5   [28.0%]
atom FE               Frontend_Bound.Fetch_Bandwidth              % Slots                       17.5   [28.0%]
atom BAD              Bad_Speculation                             % Slots                       18.0   [28.0%]
atom BAD              Bad_Speculation.Branch_Mispredicts          % Slots                       17.4   [28.0%]
warning: 22 nodes had zero counts: Base Branch_Detect Branch_Resteer Cisc DRAM_Bound Decode FPDIV_uops Fast_Nuke ICache_Misses ITLB_Misses L1_Bound L2_Bound L3_Bound MS_uops Machine_Clears Mem_Scheduler Memory_Bound Nuke Other_FB Other_Ret Predecode Store_Bound
Run toplev --describe ICache_Misses^ to get more information on bottleneck for core
Run toplev --describe Frontend_Bound^ to get more information on bottleneck for atom
Add --nodes '!+Frontend_Bound*/2,+MUX' for breakdown.
Sampling:
perf record -g -e cpu_core/event=0xc6,umask=0x1,frontend=0x14,name=ITLB_Misses_FRONTEND_RETIRED_ITLB_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x12,name=ICache_Misses_FRONTEND_RETIRED_L1I_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x13,name=ICache_Misses_FRONTEND_RETIRED_L2_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x601006,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_16,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600406,name=Frontend_Bound_FRONTEND_RETIRED_LATENCY_GE_4,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600806,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_8,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x15,name=ITLB_Misses_FRONTEND_RETIRED_STLB_MISS,period=100007/pp,cpu_core/event=0xa4,umask=0x2,name=Backend_Bound_TOPDOWN_BACKEND_BOUND_SLOTS,period=10000003/,cycles:pp -o perf.data taskset -c 0 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data (7 samples) ]
Run `perf report' to show the sampling results
Sampling:
perf record -g -e cycles:pp -o perf.data taskset -c 0 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (15 samples) ]
Run `perf report' to show the sampling results

(~/pmu-tools/toplev.py --core S0-C0 -l3 --run-sample --no-desc sleep 1 also fails)

Does a plain perf record -e cycles:pp ./run.sh

Yes, it does. E.g.:

perf stat -e cycles:pp sleep 1

 Performance counter stats for 'sleep 1':

   <not supported>      cpu_core/cycles:pp/
   <not supported>      cpu_atom/cycles:pp/

       1.003134906 seconds time elapsed

       0.002847000 seconds user
       0.000000000 seconds sys

 perf record -e cycles:pp sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (7 samples) ]

perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 7  of event 'cpu_atom/cycles:pp/'
# Event count (approx.): 5357041
#
# Overhead  Command  Shared Object      Symbol
# ........  .......  .................  .................................
#
    89.11%  sleep    [kernel.kallsyms]  [k] __get_user_8
    10.59%  sleep    [kernel.kallsyms]  [k] tlb_gather_mmu
     0.29%  perf-ex  [kernel.kallsyms]  [k] nmi_restore
     0.01%  perf-ex  [kernel.kallsyms]  [k] __intel_pmu_enable_all.isra.0
     0.00%  perf-ex  [kernel.kallsyms]  [k] native_write_msr

#
# (Tip: To add Node.js USDT(User-Level Statically Defined Tracing): perf buildid-cache --add `which node`)
thetheodor commented 7 months ago

but if I remove the --core S0-C0 part it works:

my guess is that the difference boils down to passing a -C 0 to perf. Without it everything seems to work fine.