Open thetheodor opened 11 months ago
Can you give the full toplev command line?
The PEBS events don't necessarily count in perf stat.
The problem seems to be this perf error message: WARNING: A requested CPU in '0' is not supported by PMU 'cpu_atom' (CPUs 8-23) for event 'cycles:pp'
cycles:pp should really work so that's some kind of upstream perf bug. Does a plain perf record -e cycles:pp ./run.sh
export HYPERVISOR=1 should work around it (will disable PEBS, but also some other features)
Thanks for the reply.
Can you give the full toplev command line?
~/pmu-tools/toplev.py --core S0-C0 -l3 --run-sample --no-desc taskset -c 0 ./run.sh
Playing a bit more with it:
~/pmu-tools/toplev.py --core S0-C0 -l3 --run-sample --no-desc taskset -c 0 sleep 1
# 4.7-full on Intel(R) Core(TM) i9-14900K [adl]
core FE Frontend_Bound % Slots 43.4 [11.0%]
core BE Backend_Bound % Slots 31.8 [22.0%]
core FE Frontend_Bound.Fetch_Latency % Slots 34.7 [22.0%]
core BAD Bad_Speculation.Machine_Clears % Slots 0.8 [11.0%]
core BE/Core Backend_Bound.Core_Bound % Slots 21.2 [22.0%]
core FE Frontend_Bound.Fetch_Latency.ICache_Misses % Clocks 17.1 [22.0%]
core FE Frontend_Bound.Fetch_Latency.ITLB_Misses % Clocks 6.5 [22.0%]
core FE Frontend_Bound.Fetch_Latency.Branch_Resteers % Clocks 32.4 [11.0%]<==
core FE Frontend_Bound.Fetch_Latency.MS_Switches % Clocks_est 11.5 [11.0%]
core BE/Mem Backend_Bound.Memory_Bound.L1_Bound % Stalls 5.9 [11.0%]
core BE/Core Backend_Bound.Core_Bound.Serializing_Operation % Clocks 35.1 [11.0%]
core BE/Core Backend_Bound.Core_Bound.Ports_Utilization % Clocks 45.2 [11.0%]
core RET Retiring.Heavy_Operations.Microcode_Sequencer % Slots 2.7 [11.0%]
core MUX % 11.00
Run toplev --describe Branch_Resteers^ to get more information on bottleneck for core
Add --nodes '!+Branch_Resteers*/4,+Frontend_Bound.Fetch_Latency,+Frontend_Bound,+MUX' for breakdown.
Sampling:
perf record -g -e cpu_core/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/,cpu_core/event=0xc6,umask=0x1,frontend=0x14,name=ITLB_Misses_FRONTEND_RETIRED_ITLB_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x12,name=ICache_Misses_FRONTEND_RETIRED_L1I_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x13,name=ICache_Misses_FRONTEND_RETIRED_L2_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x601006,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_16,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600406,name=Frontend_Bound_FRONTEND_RETIRED_LATENCY_GE_4,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600806,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_8,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x8,name=MS_Switches_FRONTEND_RETIRED_MS_FLOWS,period=100007/,cpu_core/event=0xc6,umask=0x1,frontend=0x15,name=ITLB_Misses_FRONTEND_RETIRED_STLB_MISS,period=100007/pp,cpu_core/event=0xc3,umask=0x1,edge=1,cmask=1,name=Machine_Clears_MACHINE_CLEARS_COUNT,period=100003/,cpu_core/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_RETIRED_FB_HIT,period=100007/pp,cpu_core/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_RETIRED_L1_HIT,period=1000003/pp,cpu_core/event=0xa2,umask=0x2,name=Serializing_Operation_RESOURCE_STALLS_SCOREBOARD,period=100003/,cpu_core/event=0xa4,umask=0x2,name=Backend_Bound_TOPDOWN_BACKEND_BOUND_SLOTS,period=10000003/,cpu_core/event=0xc2,umask=0x4,frontend=0x8,name=Microcode_Sequencer_UOPS_RETIRED_MS,period=2000003/,cycles:pp -o perf.data -C 0 taskset -c 0 sleep 1
WARNING: A requested CPU in '0' is not supported by PMU 'cpu_atom' (CPUs 8-23) for event 'cycles:pp'
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cpu_atom/cycles:pp/).
/bin/dmesg | grep -i perf may provide additional information.
Sampling failed
but if I remove the --core S0-C0
part it works:
~/pmu-tools/toplev.py -l3 --run-sample --no-desc taskset -c 0 sleep 1
70 events not counted
# 4.7-full, 3.51 on Intel(R) Core(TM) i9-14900K [adl]
core FE Frontend_Bound % Slots 42.0
core BE Backend_Bound % Slots 26.3 [28.0%]
core FE Frontend_Bound.Fetch_Latency % Slots 28.5 [28.0%]
core BE/Core Backend_Bound.Core_Bound % Slots 15.3 [28.0%]
core FE Frontend_Bound.Fetch_Latency.ICache_Misses % Clocks 12.6 [75.0%]<==
core FE Frontend_Bound.Fetch_Latency.ITLB_Misses % Clocks 6.0 [75.0%]
warning: 16 nodes had zero counts: Branch_Resteers DRAM_Bound DSB DSB_Switches Divider L1_Bound L2_Bound L3_Bound LSD MITE MS_Switches Other_Mispredicts Other_Nukes Ports_Utilization Serializing_Operation Store_Bound
atom FE Frontend_Bound % Slots 34.0 [28.0%]<==
atom FE Frontend_Bound.Fetch_Latency % Slots 16.5 [28.0%]
atom FE Frontend_Bound.Fetch_Bandwidth % Slots 17.5 [28.0%]
atom BAD Bad_Speculation % Slots 18.0 [28.0%]
atom BAD Bad_Speculation.Branch_Mispredicts % Slots 17.4 [28.0%]
warning: 22 nodes had zero counts: Base Branch_Detect Branch_Resteer Cisc DRAM_Bound Decode FPDIV_uops Fast_Nuke ICache_Misses ITLB_Misses L1_Bound L2_Bound L3_Bound MS_uops Machine_Clears Mem_Scheduler Memory_Bound Nuke Other_FB Other_Ret Predecode Store_Bound
Run toplev --describe ICache_Misses^ to get more information on bottleneck for core
Run toplev --describe Frontend_Bound^ to get more information on bottleneck for atom
Add --nodes '!+Frontend_Bound*/2,+MUX' for breakdown.
Sampling:
perf record -g -e cpu_core/event=0xc6,umask=0x1,frontend=0x14,name=ITLB_Misses_FRONTEND_RETIRED_ITLB_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x12,name=ICache_Misses_FRONTEND_RETIRED_L1I_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x13,name=ICache_Misses_FRONTEND_RETIRED_L2_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x601006,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_16,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600406,name=Frontend_Bound_FRONTEND_RETIRED_LATENCY_GE_4,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600806,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_8,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x15,name=ITLB_Misses_FRONTEND_RETIRED_STLB_MISS,period=100007/pp,cpu_core/event=0xa4,umask=0x2,name=Backend_Bound_TOPDOWN_BACKEND_BOUND_SLOTS,period=10000003/,cycles:pp -o perf.data taskset -c 0 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data (7 samples) ]
Run `perf report' to show the sampling results
Sampling:
perf record -g -e cycles:pp -o perf.data taskset -c 0 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (15 samples) ]
Run `perf report' to show the sampling results
(~/pmu-tools/toplev.py --core S0-C0 -l3 --run-sample --no-desc sleep 1
also fails)
Does a plain perf record -e cycles:pp ./run.sh
Yes, it does. E.g.:
perf stat -e cycles:pp sleep 1
Performance counter stats for 'sleep 1':
<not supported> cpu_core/cycles:pp/
<not supported> cpu_atom/cycles:pp/
1.003134906 seconds time elapsed
0.002847000 seconds user
0.000000000 seconds sys
perf record -e cycles:pp sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (7 samples) ]
perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 7 of event 'cpu_atom/cycles:pp/'
# Event count (approx.): 5357041
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. .................................
#
89.11% sleep [kernel.kallsyms] [k] __get_user_8
10.59% sleep [kernel.kallsyms] [k] tlb_gather_mmu
0.29% perf-ex [kernel.kallsyms] [k] nmi_restore
0.01% perf-ex [kernel.kallsyms] [k] __intel_pmu_enable_all.isra.0
0.00% perf-ex [kernel.kallsyms] [k] native_write_msr
#
# (Tip: To add Node.js USDT(User-Level Statically Defined Tracing): perf buildid-cache --add `which node`)
but if I remove the --core S0-C0 part it works:
my guess is that the difference boils down to passing a -C 0
to perf
. Without it everything seems to work fine.
Hi, I'm trying to use toplev on a Raptor Lake system, if I use it with
--run-sample
it tries to use:pp
events, e.g.,:the problem seems to be:
Kernel: 6.6.2 perf -v: 6.5