andikleen / pmu-tools

Intel PMU profiling tools
GNU General Public License v2.0
2.03k stars 337 forks source link

toplev --run-sample: A requested CPU in '0' is not supported by PMU 'cpu_atom' (CPUs 8-23) for event 'cycles:pp' #500

Open thetheodor opened 11 months ago

thetheodor commented 11 months ago

Hi, I'm trying to use toplev on a Raptor Lake system, if I use it with --run-sample it tries to use :pp events, e.g.,:

perf record -g -e cpu_core/event=0xc6,umask=0x1,frontend=0x8,name=MS_Switches_FRONTEND_RETIRED_MS_FLOWS,period=100007/,cpu_core/event=0xc3,umask=0x1,edge=1,cmask=1,name=Machine_Clears_MACHINE_CLEARS_COUNT,period=100003/,cpu_core/event=0xd0,umask=0x82,name=Store_Bound_MEM_INST_RETIRED_ALL_STORES,period=1000003/pp,cpu_core/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_RETIRED_FB_HIT,period=100007/pp,cpu_core/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_RETIRED_L1_HIT,period=1000003/pp,cpu_core/event=0xa4,umask=0x2,name=Backend_Bound_TOPDOWN_BACKEND_BOUND_SLOTS,period=10000003/,cpu_core/event=0xc2,umask=0x4,frontend=0x8,name=Microcode_Sequencer_UOPS_RETIRED_MS,period=2000003/,cycles:pp -o perf.data -C 0 taskset -c 0 ./run.sh

WARNING: A requested CPU in '0' is not supported by PMU 'cpu_atom' (CPUs 8-23) for event 'cycles:pp'
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cpu_atom/cycles:pp/).
/bin/dmesg | grep -i perf may provide additional information.
sudo /bin/dmesg | grep -i perf
[    0.167301] Performance Events: XSAVE Architectural LBR, PEBS fmt4+-baseline,  AnyThread deprecated, Alderlake Hybrid events, 32-deep LBR, full-width counters, Intel PMU driver.
[    0.175202] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[12747.269426] perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500

the problem seems to be:

perf stat -e cycles:pp sleep 1                                                                         

 Performance counter stats for 'sleep 1':

   <not supported>      cpu_core/cycles:pp/
   <not supported>      cpu_atom/cycles:pp/

       1.010279365 seconds time elapsed

       0.000000000 seconds user
       0.001284000 seconds sys
perf stat -e cycles sleep 1                                                                            

 Performance counter stats for 'sleep 1':

     <not counted>      cpu_core/cycles/                                                        (0.00%)
         1,486,577      cpu_atom/cycles/

       1.001472195 seconds time elapsed

       0.001138000 seconds user
       0.000000000 seconds sys

Kernel: 6.6.2 perf -v: 6.5

andikleen commented 11 months ago

Can you give the full toplev command line?

The PEBS events don't necessarily count in perf stat.

The problem seems to be this perf error message: WARNING: A requested CPU in '0' is not supported by PMU 'cpu_atom' (CPUs 8-23) for event 'cycles:pp'

cycles:pp should really work so that's some kind of upstream perf bug. Does a plain perf record -e cycles:pp ./run.sh

export HYPERVISOR=1 should work around it (will disable PEBS, but also some other features)

thetheodor commented 11 months ago

Thanks for the reply.

Can you give the full toplev command line?

~/pmu-tools/toplev.py --core S0-C0 -l3 --run-sample --no-desc taskset -c 0 ./run.sh

Playing a bit more with it:

~/pmu-tools/toplev.py --core S0-C0 -l3 --run-sample --no-desc taskset -c 0 sleep 1
# 4.7-full on Intel(R) Core(TM) i9-14900K [adl]
core FE               Frontend_Bound                                  % Slots                       43.4   [11.0%]
core BE               Backend_Bound                                   % Slots                       31.8   [22.0%]
core FE               Frontend_Bound.Fetch_Latency                    % Slots                       34.7   [22.0%]
core BAD              Bad_Speculation.Machine_Clears                  % Slots                        0.8   [11.0%]
core BE/Core          Backend_Bound.Core_Bound                        % Slots                       21.2   [22.0%]
core FE               Frontend_Bound.Fetch_Latency.ICache_Misses      % Clocks                      17.1   [22.0%]
core FE               Frontend_Bound.Fetch_Latency.ITLB_Misses        % Clocks                       6.5   [22.0%]
core FE               Frontend_Bound.Fetch_Latency.Branch_Resteers    % Clocks                      32.4   [11.0%]<==
core FE               Frontend_Bound.Fetch_Latency.MS_Switches        % Clocks_est                  11.5   [11.0%]
core BE/Mem           Backend_Bound.Memory_Bound.L1_Bound             % Stalls                       5.9   [11.0%]
core BE/Core          Backend_Bound.Core_Bound.Serializing_Operation  % Clocks                      35.1   [11.0%]
core BE/Core          Backend_Bound.Core_Bound.Ports_Utilization      % Clocks                      45.2   [11.0%]
core RET              Retiring.Heavy_Operations.Microcode_Sequencer   % Slots                        2.7   [11.0%]
core MUX                                                              %                             11.00
Run toplev --describe Branch_Resteers^ to get more information on bottleneck for core
Add --nodes '!+Branch_Resteers*/4,+Frontend_Bound.Fetch_Latency,+Frontend_Bound,+MUX' for breakdown.
Sampling:
perf record -g -e cpu_core/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/,cpu_core/event=0xc6,umask=0x1,frontend=0x14,name=ITLB_Misses_FRONTEND_RETIRED_ITLB_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x12,name=ICache_Misses_FRONTEND_RETIRED_L1I_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x13,name=ICache_Misses_FRONTEND_RETIRED_L2_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x601006,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_16,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600406,name=Frontend_Bound_FRONTEND_RETIRED_LATENCY_GE_4,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600806,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_8,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x8,name=MS_Switches_FRONTEND_RETIRED_MS_FLOWS,period=100007/,cpu_core/event=0xc6,umask=0x1,frontend=0x15,name=ITLB_Misses_FRONTEND_RETIRED_STLB_MISS,period=100007/pp,cpu_core/event=0xc3,umask=0x1,edge=1,cmask=1,name=Machine_Clears_MACHINE_CLEARS_COUNT,period=100003/,cpu_core/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_RETIRED_FB_HIT,period=100007/pp,cpu_core/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_RETIRED_L1_HIT,period=1000003/pp,cpu_core/event=0xa2,umask=0x2,name=Serializing_Operation_RESOURCE_STALLS_SCOREBOARD,period=100003/,cpu_core/event=0xa4,umask=0x2,name=Backend_Bound_TOPDOWN_BACKEND_BOUND_SLOTS,period=10000003/,cpu_core/event=0xc2,umask=0x4,frontend=0x8,name=Microcode_Sequencer_UOPS_RETIRED_MS,period=2000003/,cycles:pp -o perf.data -C 0 taskset -c 0 sleep 1
WARNING: A requested CPU in '0' is not supported by PMU 'cpu_atom' (CPUs 8-23) for event 'cycles:pp'
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cpu_atom/cycles:pp/).
/bin/dmesg | grep -i perf may provide additional information.

Sampling failed

but if I remove the --core S0-C0 part it works:

~/pmu-tools/toplev.py  -l3 --run-sample --no-desc taskset -c 0 sleep 1
70 events not counted
# 4.7-full, 3.51 on Intel(R) Core(TM) i9-14900K [adl]
core FE               Frontend_Bound                              % Slots                       42.0
core BE               Backend_Bound                               % Slots                       26.3   [28.0%]
core FE               Frontend_Bound.Fetch_Latency                % Slots                       28.5   [28.0%]
core BE/Core          Backend_Bound.Core_Bound                    % Slots                       15.3   [28.0%]
core FE               Frontend_Bound.Fetch_Latency.ICache_Misses  % Clocks                      12.6   [75.0%]<==
core FE               Frontend_Bound.Fetch_Latency.ITLB_Misses    % Clocks                       6.0   [75.0%]
warning: 16 nodes had zero counts: Branch_Resteers DRAM_Bound DSB DSB_Switches Divider L1_Bound L2_Bound L3_Bound LSD MITE MS_Switches Other_Mispredicts Other_Nukes Ports_Utilization Serializing_Operation Store_Bound
atom FE               Frontend_Bound                              % Slots                       34.0   [28.0%]<==
atom FE               Frontend_Bound.Fetch_Latency                % Slots                       16.5   [28.0%]
atom FE               Frontend_Bound.Fetch_Bandwidth              % Slots                       17.5   [28.0%]
atom BAD              Bad_Speculation                             % Slots                       18.0   [28.0%]
atom BAD              Bad_Speculation.Branch_Mispredicts          % Slots                       17.4   [28.0%]
warning: 22 nodes had zero counts: Base Branch_Detect Branch_Resteer Cisc DRAM_Bound Decode FPDIV_uops Fast_Nuke ICache_Misses ITLB_Misses L1_Bound L2_Bound L3_Bound MS_uops Machine_Clears Mem_Scheduler Memory_Bound Nuke Other_FB Other_Ret Predecode Store_Bound
Run toplev --describe ICache_Misses^ to get more information on bottleneck for core
Run toplev --describe Frontend_Bound^ to get more information on bottleneck for atom
Add --nodes '!+Frontend_Bound*/2,+MUX' for breakdown.
Sampling:
perf record -g -e cpu_core/event=0xc6,umask=0x1,frontend=0x14,name=ITLB_Misses_FRONTEND_RETIRED_ITLB_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x12,name=ICache_Misses_FRONTEND_RETIRED_L1I_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x13,name=ICache_Misses_FRONTEND_RETIRED_L2_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x601006,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_16,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600406,name=Frontend_Bound_FRONTEND_RETIRED_LATENCY_GE_4,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600806,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_8,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x15,name=ITLB_Misses_FRONTEND_RETIRED_STLB_MISS,period=100007/pp,cpu_core/event=0xa4,umask=0x2,name=Backend_Bound_TOPDOWN_BACKEND_BOUND_SLOTS,period=10000003/,cycles:pp -o perf.data taskset -c 0 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data (7 samples) ]
Run `perf report' to show the sampling results
Sampling:
perf record -g -e cycles:pp -o perf.data taskset -c 0 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (15 samples) ]
Run `perf report' to show the sampling results

(~/pmu-tools/toplev.py --core S0-C0 -l3 --run-sample --no-desc sleep 1 also fails)

Does a plain perf record -e cycles:pp ./run.sh

Yes, it does. E.g.:

perf stat -e cycles:pp sleep 1

 Performance counter stats for 'sleep 1':

   <not supported>      cpu_core/cycles:pp/
   <not supported>      cpu_atom/cycles:pp/

       1.003134906 seconds time elapsed

       0.002847000 seconds user
       0.000000000 seconds sys

 perf record -e cycles:pp sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (7 samples) ]

perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 7  of event 'cpu_atom/cycles:pp/'
# Event count (approx.): 5357041
#
# Overhead  Command  Shared Object      Symbol
# ........  .......  .................  .................................
#
    89.11%  sleep    [kernel.kallsyms]  [k] __get_user_8
    10.59%  sleep    [kernel.kallsyms]  [k] tlb_gather_mmu
     0.29%  perf-ex  [kernel.kallsyms]  [k] nmi_restore
     0.01%  perf-ex  [kernel.kallsyms]  [k] __intel_pmu_enable_all.isra.0
     0.00%  perf-ex  [kernel.kallsyms]  [k] native_write_msr

#
# (Tip: To add Node.js USDT(User-Level Statically Defined Tracing): perf buildid-cache --add `which node`)
thetheodor commented 11 months ago

but if I remove the --core S0-C0 part it works:

my guess is that the difference boils down to passing a -C 0 to perf. Without it everything seems to work fine.