andikleen / pmu-tools

Intel PMU profiling tools
GNU General Public License v2.0
1.99k stars 334 forks source link

division by zero errors - missing clock count etc. #81

Open milianw opened 7 years ago

milianw commented 7 years ago

On my machine, the total clock count is not measured. At the bottom of the toplev output, I also see that I get 11 division by zero errors: IPC CoreIPC CPI ..., which seems to indicate that the instruction / cycle count fails on my machine? This is odd, considering that cycles/instructions

┌milian@milian-kdab2:~
└$ cat /proc/cpuinfo | grep --max-count=1 CPU 
model name      : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
┌milian@milian-kdab2:~
└$ dmesg | grep -i pmu
[    0.124426] Performance Events: PEBS fmt2+, Haswell events, 16-deep LBR, full-width counters, Intel PMU driver.
[    0.124811] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[    1.264710] core: PMU erratum BJ122, BV98, HSD29 worked around, HT is on
[   10.193121] RAPL PMU: API unit is 2^-32 Joules, 4 fixed counters, 655360 ms ovfl timer
[   10.193122] RAPL PMU: hw unit of domain pp0-core 2^-14 Joules
[   10.193123] RAPL PMU: hw unit of domain package 2^-14 Joules
[   10.193123] RAPL PMU: hw unit of domain dram 2^-14 Joules
[   10.193124] RAPL PMU: hw unit of domain pp1-gpu 2^-14 Joules
┌milian@milian-kdab2:~
└$ dmesg | grep -i pebs
[    0.124426] Performance Events: PEBS fmt2+, Haswell events, 16-deep LBR, full-width counters, Intel PMU driver.

I then disabled the NMI watchdog and try to launch toplev:

┌milian@milian-kdab2:~
└$ toplev -l2 -S --metrics ~/projects/kdab/training-material/addon/profiling/build/lab_mandelbrot/src/lab_mandelbrot -b 5 
Using level 2.
Nodes Backend_Bound Frontend_Latency Branch_Misprediction_Cost
Memory_Bound Frontend_Bandwidth Core_Bound ILP IPL IPS Frontend_Bound
Load_Miss_Real_Latency have errata HSD135 HSD30, HSM31 HSM30 HSD29,
HSM30
perf stat -x\; -e '{cpu/event=0xd0,umask=0x81/,cpu/event=0xc4,umask=0x20/,cpu/event=0xc2,umask=0x2/,ref-cycles,cpu/event=0xd0,umask=0x82/,cycles,instructions},{cpu/event=0x51,umask=0x1/,cpu/event=0xc4,umask=0x0/,cpu/event=0xf1,umask=0x7/,cpu/event=0xc4,umask=0x2/,instructions},uncore_arb/event=0x80,umask=0x1,cmask=1/,{msr/tsc/,cpu/event=0x0,umask=0x3/k,cycles,cpu/event=0x3c,umask=0x0,in_tx=1/,ref-cycles},{cpu/event=0xb1,umask=0x2,cmask=1/,cpu/event=0xb1,umask=0x2/,cpu/event=0xc4,umask=0x20/,cpu/event=0xc4,umask=0x0/},{cpu/event=0x3c,umask=0x0,in_tx_cp=1/,cpu/event=0x3c,umask=0x0,in_tx=1/,cycles},{cpu/event=0xd1,umask=0x40/,cpu/event=0xd1,umask=0x8/,cpu/event=0x48,umask=0x1/},{cpu/event=0x8,umask=0x10/,cpu/event=0x49,umask=0x10/,cycles,cpu/event=0x85,umask=0x10/},{cpu/event=0x80,umask=0x1/,cpu/event=0x80,umask=0x2/,instructions,cpu/event=0x79,umask=0x4/,cpu/event=0xc2,umask=0x2/},{cpu/event=0x79,umask=0x8/,cpu/event=0x79,umask=0x30/,cpu/event=0x79,umask=0x4/,cpu/event=0xa8,umask=0x1/},{cpu/event=0x80,umask=0x4/,cpu/event=0x85,umask=0x60/,cpu/event=0x5e,umask=0x1/,cpu/event=0x85,umask=0x10/},cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1/,uncore_arb/event=0x81,umask=0x1/,uncore_arb/event=0x84,umask=0x1/,uncore_arb/event=0x80,umask=0x1/,cpu/event=0x48,umask=0x1,cmask=1/,{cpu/event=0xe,umask=0x1/,cpu/event=0xd,umask=0x3,cmask=1/,cycles,cpu/event=0xc2,umask=0x2/},{cpu/event=0xc5,umask=0x0/,cycles,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/,cpu/event=0x9c,umask=0x1,cmask=4/},{cpu/event=0xc5,umask=0x0/,cpu/event=0xe6,umask=0x1f/,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1/,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/},{cpu/event=0xab,umask=0x2/,cpu/event=0x87,umask=0x1/,cycles,cpu/event=0x79,umask=0x30,edge=1,cmask=1/},{cpu/event=0xc2,umask=0x2/,cpu/event=0xe,umask=0x1/,cpu/event=0xd,umask=0x3,cmask=1/,cpu/event=0x9c,umask=0x1/,cycles},{instructions,cycles,cpu/event=0x9c,umask=0x1/,cpu/event=0x9c,umask=0x1,cmask=4/},{cpu/event=0xe,umask=0x1/,cycles,cpu/event=0x79,umask=0x30/,cpu/event=0xc2,umask=0x2/},{cpu/event=0xa3,umask=0x6,cmask=6/,cpu/event=0xa3,umask=0x4,cmask=4/,cycles,cpu/event=0xa2,umask=0x8/,cpu/event=0xb1,umask=0x2,cmask=1/},{cpu/event=0xb1,umask=0x2,cmask=2/,cpu/event=0xb1,umask=0x2,cmask=3/,cpu/event=0x9c,umask=0x1,cmask=4/,cpu/event=0x5e,umask=0x1/,instructions}' /home/milian/projects/kdab/training-material/addon/profiling/build/lab_mandelbrot/src/lab_mandelbrot -b 5
5 events not supported, 10 events not counted
BE             Backend_Bound:                      42.12 % Slots     [  4.85%] BN
BE/Mem         Backend_Bound.Memory_Bound:         20.64 % Slots     [  4.37%]
        This metric represents slots fraction the Memory subsystem
        within the Backend was a bottleneck...
BE/Core        Backend_Bound.Core_Bound:           21.49 % Slots     [  4.37%]
        This metric represents slots fraction where Core non-memory
        issues were of a bottleneck...
Info.Thread    BPTB:                                1.49 Metric      [  5.13%]
        Branch instructions per taken branch...
Info.Inst_Mix  IPB:                                 5.94 Metric      [  5.14%]
        Instructions per Branch
Info.Inst_Mix  IPCall:                             33.72 Metric      [  5.14%]
        Instruction per (near) call
Info.Thread    IFetch_Line_Utilization:             0.48 Metric      [  5.06%]
        Rough Estimation of fraction of fetched lines bytes that
        were likely consumed by program instructions
Info.Thread    DSB_Coverage:                        0.98 Metric      [  5.03%]
        Fraction of Uops delivered by the DSB (aka Decoded Icache;
        or Uop Cache)...
Info.Core      ILP:                                 2.72 CoreMetric  [  5.13%]
        Instruction-Level-Parallelism (average number of uops
        executed when there is at least 1 uop executed)
Info.Core      Branch_Misprediction_Cost:         275.00 CoreMetric  [  4.88%]
        Branch Misprediction Cost: Fraction of Issue Slots wasted
        per branch misprediction
Info.Core      BAClear_Cost:                       29.42 Metric      [  4.99%]
        Average Branch Address Clear Cost (fraction of cycles)
Info.Memory    Page_Walks_Utilization:              0.00 CoreMetric  [  5.08%]
        Utilization of the core's Page Walker(s) serving STLB misses
        triggered by instruction/Load/Store accesses
Info.Memory    L1D_Cache_Fill_BW:                   0.70 Metric      [  5.14%]
        Average data fill bandwidth to the L1 data cache [GB / sec]
Info.Memory    L2_Cache_Fill_BW:                    0.33 Metric      [  5.14%]
        Average data fill bandwidth to the L2 cache [GB / sec]
Info.Memory    TSX_Transactional_Cycles:            1.00 Clocks      [  5.17%]
        Fraction of cycles where the CPU is running in Transactional
        Memory mode (HLE or RTM)
Info.System    Time:                                2.38 Count      
        Run duration time in seconds
warning: 2 results not referenced: 47 52
warning: 11 division by zero errors: IPC CoreIPC CPI UPI IPL Load_Miss_Real_Latency IPS Kernel_Utilization Turbo_Utilization IPTB MEM_Request_Latency

If I just try to run parts of the perf command, I also see the broken output:

$ perf stat -e '{cpu/event=0xd0,umask=0x81/,cpu/event=0xc4,umask=0x20/,cpu/event=0xc2,umask=0x2/,ref-cycles,cpu/event=0xd0,umask=0x82/,cycles,instructions}' ~/projects/kdab/training-material/addon/profiling/build/lab_mandelbrot/src/lab_mandelbrot -b 5 

 Performance counter stats for '/home/milian/projects/kdab/training-material/addon/profiling/build/lab_mandelbrot/src/lab_mandelbrot -b 5':

     <not counted>      cpu/event=0xd0,umask=0x81/                                     (0.00%)
     <not counted>      cpu/event=0xc4,umask=0x20/                                     (0.00%)
     <not counted>      cpu/event=0xc2,umask=0x2/                                     (0.00%)
     <not counted>      ref-cycles                                                    (0.00%)
     <not counted>      cpu/event=0xd0,umask=0x82/                                     (0.00%)
     <not counted>      cycles                                                        (0.00%)
     <not counted>      instructions                                                  (0.00%)

       2.339836718 seconds time elapsed

If I remove any two of the events, the command starts to produce proper output:

$ perf stat -e '{cpu/event=0xc4,umask=0x20/,ref-cycles,cpu/event=0xd0,umask=0x82/,cycles,instructions}' ~/projects/kdab/training-material/addon/profiling/build/lab_mandelbrot/src/lab_mandelbrot -b 5 
 Performance counter stats for '/home/milian/projects/kdab/training-material/addon/profiling/build/lab_mandelbrot/src/lab_mandelbrot -b 5':

     1,933,766,252      cpu/event=0xc4,umask=0x20/                                   
     7,729,695,286      ref-cycles                                                  
     1,928,066,897      cpu/event=0xd0,umask=0x82/                                   
     8,537,031,497      cycles                                                      
    17,157,571,475      instructions              #    2.01  insn per cycle         

       2.325396519 seconds time elapsed

Is this maybe due to hyper threading?

andikleen commented 7 years ago

Yes could be due to the HT exclusion workaround. Also try something that runs longer.

You can disable HT with

cputop.py 'thread == 1' offline | sh

(or thread == 0, if cpu 0 is included)

milianw commented 7 years ago

What do I need to get e.g. /sys/devices/system/cpu/cpu0/online? I don't have the online pseudo-file.

andikleen commented 7 years ago

It's a CONFIG option of the kernel.

But if thread == 1 includes cpu 0, just off line thread == 0 instead. It just has to be either all 0 or all 1 to disable hyper threading. Alternatively you can also disable it in the BIOS.

thuytien140894 commented 6 years ago

Hello! I encounter the same problem. I still have the same "division by zero" errors even though I disabled hyper-threading using cputop.py. Are there any other potential problems that could cause this?

I can see all the main metrics when using "perf stat -d my_workload".

Thanks