Open thetheodor opened 1 year ago
Can you give the full toplev command line?
The PEBS events don't necessarily count in perf stat.
The problem seems to be this perf error message: WARNING: A requested CPU in '0' is not supported by PMU 'cpu_atom' (CPUs 8-23) for event 'cycles:pp'
cycles:pp should really work so that's some kind of upstream perf bug. Does a plain perf record -e cycles:pp ./run.sh
export HYPERVISOR=1 should work around it (will disable PEBS, but also some other features)
Thanks for the reply.
Can you give the full toplev command line?
~/pmu-tools/toplev.py --core S0-C0 -l3 --run-sample --no-desc taskset -c 0 ./run.sh
Playing a bit more with it:
~/pmu-tools/toplev.py --core S0-C0 -l3 --run-sample --no-desc taskset -c 0 sleep 1
# 4.7-full on Intel(R) Core(TM) i9-14900K [adl]
core FE Frontend_Bound % Slots 43.4 [11.0%]
core BE Backend_Bound % Slots 31.8 [22.0%]
core FE Frontend_Bound.Fetch_Latency % Slots 34.7 [22.0%]
core BAD Bad_Speculation.Machine_Clears % Slots 0.8 [11.0%]
core BE/Core Backend_Bound.Core_Bound % Slots 21.2 [22.0%]
core FE Frontend_Bound.Fetch_Latency.ICache_Misses % Clocks 17.1 [22.0%]
core FE Frontend_Bound.Fetch_Latency.ITLB_Misses % Clocks 6.5 [22.0%]
core FE Frontend_Bound.Fetch_Latency.Branch_Resteers % Clocks 32.4 [11.0%]<==
core FE Frontend_Bound.Fetch_Latency.MS_Switches % Clocks_est 11.5 [11.0%]
core BE/Mem Backend_Bound.Memory_Bound.L1_Bound % Stalls 5.9 [11.0%]
core BE/Core Backend_Bound.Core_Bound.Serializing_Operation % Clocks 35.1 [11.0%]
core BE/Core Backend_Bound.Core_Bound.Ports_Utilization % Clocks 45.2 [11.0%]
core RET Retiring.Heavy_Operations.Microcode_Sequencer % Slots 2.7 [11.0%]
core MUX % 11.00
Run toplev --describe Branch_Resteers^ to get more information on bottleneck for core
Add --nodes '!+Branch_Resteers*/4,+Frontend_Bound.Fetch_Latency,+Frontend_Bound,+MUX' for breakdown.
Sampling:
perf record -g -e cpu_core/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/,cpu_core/event=0xc6,umask=0x1,frontend=0x14,name=ITLB_Misses_FRONTEND_RETIRED_ITLB_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x12,name=ICache_Misses_FRONTEND_RETIRED_L1I_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x13,name=ICache_Misses_FRONTEND_RETIRED_L2_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x601006,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_16,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600406,name=Frontend_Bound_FRONTEND_RETIRED_LATENCY_GE_4,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600806,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_8,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x8,name=MS_Switches_FRONTEND_RETIRED_MS_FLOWS,period=100007/,cpu_core/event=0xc6,umask=0x1,frontend=0x15,name=ITLB_Misses_FRONTEND_RETIRED_STLB_MISS,period=100007/pp,cpu_core/event=0xc3,umask=0x1,edge=1,cmask=1,name=Machine_Clears_MACHINE_CLEARS_COUNT,period=100003/,cpu_core/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_RETIRED_FB_HIT,period=100007/pp,cpu_core/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_RETIRED_L1_HIT,period=1000003/pp,cpu_core/event=0xa2,umask=0x2,name=Serializing_Operation_RESOURCE_STALLS_SCOREBOARD,period=100003/,cpu_core/event=0xa4,umask=0x2,name=Backend_Bound_TOPDOWN_BACKEND_BOUND_SLOTS,period=10000003/,cpu_core/event=0xc2,umask=0x4,frontend=0x8,name=Microcode_Sequencer_UOPS_RETIRED_MS,period=2000003/,cycles:pp -o perf.data -C 0 taskset -c 0 sleep 1
WARNING: A requested CPU in '0' is not supported by PMU 'cpu_atom' (CPUs 8-23) for event 'cycles:pp'
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cpu_atom/cycles:pp/).
/bin/dmesg | grep -i perf may provide additional information.
Sampling failed
but if I remove the --core S0-C0
part it works:
~/pmu-tools/toplev.py -l3 --run-sample --no-desc taskset -c 0 sleep 1
70 events not counted
# 4.7-full, 3.51 on Intel(R) Core(TM) i9-14900K [adl]
core FE Frontend_Bound % Slots 42.0
core BE Backend_Bound % Slots 26.3 [28.0%]
core FE Frontend_Bound.Fetch_Latency % Slots 28.5 [28.0%]
core BE/Core Backend_Bound.Core_Bound % Slots 15.3 [28.0%]
core FE Frontend_Bound.Fetch_Latency.ICache_Misses % Clocks 12.6 [75.0%]<==
core FE Frontend_Bound.Fetch_Latency.ITLB_Misses % Clocks 6.0 [75.0%]
warning: 16 nodes had zero counts: Branch_Resteers DRAM_Bound DSB DSB_Switches Divider L1_Bound L2_Bound L3_Bound LSD MITE MS_Switches Other_Mispredicts Other_Nukes Ports_Utilization Serializing_Operation Store_Bound
atom FE Frontend_Bound % Slots 34.0 [28.0%]<==
atom FE Frontend_Bound.Fetch_Latency % Slots 16.5 [28.0%]
atom FE Frontend_Bound.Fetch_Bandwidth % Slots 17.5 [28.0%]
atom BAD Bad_Speculation % Slots 18.0 [28.0%]
atom BAD Bad_Speculation.Branch_Mispredicts % Slots 17.4 [28.0%]
warning: 22 nodes had zero counts: Base Branch_Detect Branch_Resteer Cisc DRAM_Bound Decode FPDIV_uops Fast_Nuke ICache_Misses ITLB_Misses L1_Bound L2_Bound L3_Bound MS_uops Machine_Clears Mem_Scheduler Memory_Bound Nuke Other_FB Other_Ret Predecode Store_Bound
Run toplev --describe ICache_Misses^ to get more information on bottleneck for core
Run toplev --describe Frontend_Bound^ to get more information on bottleneck for atom
Add --nodes '!+Frontend_Bound*/2,+MUX' for breakdown.
Sampling:
perf record -g -e cpu_core/event=0xc6,umask=0x1,frontend=0x14,name=ITLB_Misses_FRONTEND_RETIRED_ITLB_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x12,name=ICache_Misses_FRONTEND_RETIRED_L1I_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x13,name=ICache_Misses_FRONTEND_RETIRED_L2_MISS,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x601006,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_16,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600406,name=Frontend_Bound_FRONTEND_RETIRED_LATENCY_GE_4,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x600806,name=Fetch_Latency_FRONTEND_RETIRED_LATENCY_GE_8,period=100007/pp,cpu_core/event=0xc6,umask=0x1,frontend=0x15,name=ITLB_Misses_FRONTEND_RETIRED_STLB_MISS,period=100007/pp,cpu_core/event=0xa4,umask=0x2,name=Backend_Bound_TOPDOWN_BACKEND_BOUND_SLOTS,period=10000003/,cycles:pp -o perf.data taskset -c 0 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data (7 samples) ]
Run `perf report' to show the sampling results
Sampling:
perf record -g -e cycles:pp -o perf.data taskset -c 0 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (15 samples) ]
Run `perf report' to show the sampling results
(~/pmu-tools/toplev.py --core S0-C0 -l3 --run-sample --no-desc sleep 1
also fails)
Does a plain perf record -e cycles:pp ./run.sh
Yes, it does. E.g.:
perf stat -e cycles:pp sleep 1
Performance counter stats for 'sleep 1':
<not supported> cpu_core/cycles:pp/
<not supported> cpu_atom/cycles:pp/
1.003134906 seconds time elapsed
0.002847000 seconds user
0.000000000 seconds sys
perf record -e cycles:pp sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (7 samples) ]
perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 7 of event 'cpu_atom/cycles:pp/'
# Event count (approx.): 5357041
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. .................................
#
89.11% sleep [kernel.kallsyms] [k] __get_user_8
10.59% sleep [kernel.kallsyms] [k] tlb_gather_mmu
0.29% perf-ex [kernel.kallsyms] [k] nmi_restore
0.01% perf-ex [kernel.kallsyms] [k] __intel_pmu_enable_all.isra.0
0.00% perf-ex [kernel.kallsyms] [k] native_write_msr
#
# (Tip: To add Node.js USDT(User-Level Statically Defined Tracing): perf buildid-cache --add `which node`)
but if I remove the --core S0-C0 part it works:
my guess is that the difference boils down to passing a -C 0
to perf
. Without it everything seems to work fine.
Hi, I'm trying to use toplev on a Raptor Lake system, if I use it with
--run-sample
it tries to use:pp
events, e.g.,:the problem seems to be:
Kernel: 6.6.2 perf -v: 6.5