andikleen / pmu-tools

Intel PMU profiling tools
GNU General Public License v2.0
2.01k stars 335 forks source link

Per-function tracing #273

Closed sesse closed 4 years ago

sesse commented 4 years ago

Hi,

In most real software, I have lots of functions with hugely different performance characteristics; some will be frontend-bound, some will be cache-heavy, and so on. So a top-level view, along the lines of toplev.py or perf stat -M TopDownL1, isn't all that useful, since it averages together all of the different functions and it's hard to drill down.

VTune users keep posting screenshots that tell me this is possible to get, e.g. (function names column hidden): https://pbs.twimg.com/media/EdaQcMzU4AEjuZa?format=png&name=medium

Is this possible to get with toplev.py? I've seen --run-sample, but it only really picks up one metric, which rarely has any samples at all, and doesn't give me any useful information in general.

andikleen commented 4 years ago

--run-sample should work. It should be far more accurate than what VTune does since it directly pinpoints occurrences versus end regions of sampling periods. Not sure what you mean with "no useful information"

sesse commented 4 years ago

Let me give you an example of how I'm trying to use it, where it doesn't help me at all. I have a program (https://github.com/quackle/quackle) where a regular perf cycles profile looks roughly like this:

    33.55%  test     test                  [.] Quackle::Generator::gordongen
    18.94%  test     test                  [.] Quackle::Generator::gordongoon
    12.86%  test     test                  [.] std::__detail::_Map_base<Quackle::FixedLengthString, std::pair<Quackle::FixedLengthString c
     6.96%  test     test                  [.] Quackle::Board::score
     6.47%  test     test                  [.] Quackle::Rack::unload
     3.08%  test     libc-2.28.so          [.] __memmove_avx_unaligned_erms

Of this, I would expect gordongen and gordongoon to most likely be bound by bad speculation, and the std::__detail::_Map_base function to be ripe with cache misses. Rack::unload might be port bound (I'm not sure). So I'd like an idea, for each function, where it's bound (and that's what VTune would give me). toplev.py on the program gives:

# 4.0-full on Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
S0-C0     FE             Frontend_Bound   % Slots                  27.74  <==
    This category represents fraction of slots where the
    processor's Frontend undersupplies its Backend...
S0-C0     BAD            Bad_Speculation  % Slots                  13.25 
    This category represents fraction of slots wasted due to
    incorrect speculations...
S0-C0     BE             Backend_Bound    % Slots                  25.29 
    This category represents fraction of slots where no uops are
    being delivered due to a lack of required resources for
    accepting new uops in the Backend...
S0-C1     FE             Frontend_Bound   % Slots                  24.03 
S0-C1     BAD            Bad_Speculation  % Slots                  18.25 
S0-C1     BE             Backend_Bound    % Slots                  24.84  <==
S0-C2     FE             Frontend_Bound   % Slots                  23.77  <==
S0-C2     BAD            Bad_Speculation  % Slots                  22.84 
S0-C2     BE             Backend_Bound    % Slots                  14.72 
S0-C3     FE             Frontend_Bound   % Slots                  33.15  <==
S0-C3     BAD            Bad_Speculation  % Slots                  15.42 
S0-C3     BE             Backend_Bound    % Slots                  22.51 
S0-C4     FE             Frontend_Bound   % Slots                  30.48 
S0-C4     BE             Backend_Bound    % Slots                  37.37  <==
S0-C8     FE             Frontend_Bound   % Slots                  31.69 
S0-C8     BE             Backend_Bound    % Slots                  32.76  <==
S0-C9     FE             Frontend_Bound   % Slots                  24.22  <==
S0-C9     BAD            Bad_Speculation  % Slots                  18.77 
S0-C9     BE             Backend_Bound    % Slots                  21.93 
S0-C10    FE             Frontend_Bound   % Slots                  23.21  <==
S0-C10    BAD            Bad_Speculation  % Slots                  22.61 
S0-C10    BE             Backend_Bound    % Slots                  16.94 
S0-C11    FE             Frontend_Bound   % Slots                  25.23  <==
S0-C11    BAD            Bad_Speculation  % Slots                  15.70 
S0-C11    BE             Backend_Bound    % Slots                  23.80 
S0-C12    FE             Frontend_Bound   % Slots                  28.72  <==
S0-C12    BAD            Bad_Speculation  % Slots                  18.46 
S0-C12    BE             Backend_Bound    % Slots                  18.57 
S1-C0     FE             Frontend_Bound   % Slots                  45.84  <==
S1-C1     FE             Frontend_Bound   % Slots                  36.56  <==
S1-C1     BE             Backend_Bound    % Slots                  36.16 
S1-C2     BE             Backend_Bound    % Slots                  52.20  <==
S1-C3     FE             Frontend_Bound   % Slots                  23.29  <==
S1-C3     BAD            Bad_Speculation  % Slots                  20.50 
S1-C3     BE             Backend_Bound    % Slots                  21.84 
S1-C4     FE             Frontend_Bound   % Slots                  32.61  <==
S1-C4     BAD            Bad_Speculation  % Slots                  17.06 
S1-C4     BE             Backend_Bound    % Slots                  19.75 
S1-C8     FE             Frontend_Bound   % Slots                  22.97  <==
S1-C8     BAD            Bad_Speculation  % Slots                  20.87 
S1-C8     BE             Backend_Bound    % Slots                  21.40 
S1-C9     FE             Frontend_Bound   % Slots                  31.94 
S1-C9     BE             Backend_Bound    % Slots                  38.10  <==
S1-C10    FE             Frontend_Bound   % Slots                  34.99  <==
S1-C10    BE             Backend_Bound    % Slots                  32.09 
S1-C11    FE             Frontend_Bound   % Slots                  28.60 
S1-C11    BE             Backend_Bound    % Slots                  37.28  <==
S1-C12    FE             Frontend_Bound   % Slots                  35.21 
S1-C12    BE             Backend_Bound    % Slots                  40.56  <==

OK, so there are too many cores to say anything useful (the program fires up a few threads, so I can't use --single-thread, but most of the time, only one of them is active). Let me try on my laptop:

C0    FE             Frontend_Bound   % Slots                  27.80  <==
    This category represents fraction of slots where the
    processor's Frontend undersupplies its Backend...
    Sampling events:  frontend_retired.latency_ge_8:pp
C0    BAD            Bad_Speculation  % Slots                  21.75 
    This category represents fraction of slots wasted due to
    incorrect speculations...
C0    BE             Backend_Bound    % Slots                  16.39 
    This category represents fraction of slots where no uops are
    being delivered due to a lack of required resources for
    accepting new uops in the Backend...
C0-T0 MUX                             %                        14.27 
    PerfMon Event Multiplexing accuracy indicator
C1    FE             Frontend_Bound   % Slots                  23.27 
C1    BAD            Bad_Speculation  % Slots                  25.17  <==
C1    BE             Backend_Bound    % Slots                  12.83 
C1-T0 MUX                             %                        33.32 
C2    FE             Frontend_Bound   % Slots                  23.84  <==
C2    BAD            Bad_Speculation  % Slots                  19.57 
C2    BE             Backend_Bound    % Slots                  16.10 
C2-T0 MUX                             %                        33.32 
C3    FE             Frontend_Bound   % Slots                  24.86  <==
C3    BAD            Bad_Speculation  % Slots                  20.55 
C3    BE             Backend_Bound    % Slots                  14.83 
C3-T0 MUX                             %                        33.33 
C0-T1 MUX                             %                        33.32 
C1-T1 MUX                             %                        33.32 
C2-T1 MUX                             %                        33.32 
C3-T1 MUX                             %                        33.32 

So overall, it's a bit more front-end bound than back-end bound, but it's probably different between the functions. It recommends frontend_retired.latency_ge_8:pp, so let me run again with --run-sample:

[…]
perf record -g -e cycles:pp,cpu/event=0xc6,umask=0x1,frontend=0x400806,name=Frontend_Bound_FRONTEND_RETIRED_LATENCY_GE_8,period=100007/pp -o perf.data -a ./test --mode=selfplay --seed=17234 --repetitions=1
[…]
[ perf record: Woken up 96 times to write data ]
[ perf record: Captured and wrote 25,556 MB perf.data (253887 samples) ]
Run `perf report -i perf.data' to show the sampling results

I run the command, so I can choose between cycles and Frontend_Bound_FRONTEND_RETIRED_LATENCY_GE_8. The former is just regular perf, so I'm trying the latter:

+   16,44%    15,61%  test             test                           [.] Quackle::Generator::gordongen
+   12,05%     0,00%  swapper          [kernel.kallsyms]              [k] secondary_startup_64
+   12,05%     0,00%  swapper          [kernel.kallsyms]              [k] cpu_startup_entry
+   12,05%     0,61%  swapper          [kernel.kallsyms]              [k] do_idle
+   10,53%     0,00%  swapper          [kernel.kallsyms]              [k] start_secondary
+    9,55%     8,86%  test             test                           [.] Quackle::Generator::gordongoon
+    7,65%     0,00%  test             [kernel.kallsyms]              [k] entry_SYSCALL_64_after_hwframe
+    7,65%     0,23%  test             [kernel.kallsyms]              [k] do_syscall_64
+    6,97%     0,76%  test             [unknown]                      [.] 0000000000000000
+    6,89%     0,38%  test             [kernel.kallsyms]              [k] __x64_sys_futex
+    6,52%     0,30%  test             [kernel.kallsyms]              [k] do_futex
+    5,76%     0,00%  test             [unknown]                      [.] 0x0000000000000015
+    5,53%     0,00%  test             [unknown]                      [k] 0x616c502070696873
+    4,32%     0,15%  swapper          [kernel.kallsyms]              [k] cpuidle_enter
+    4,17%     0,23%  swapper          [kernel.kallsyms]              [k] cpuidle_enter_state
+    4,17%     0,15%  test             libpthread-2.31.so             [.] pthread_cond_wait@@GLIBC_2.3.2
+    3,48%     0,08%  test             libpthread-2.31.so             [.] pthread_cond_signal@@GLIBC_2.3.2

Most of the hits here are in the kernel, which doesn't make much sense, as the program hardly ever does any syscalls. gordgongen() and gordongoon() are in the profile, but looking at the annotation does not make me much wiser; a few test instructions are red-hot, does that mean this function is more FE-bound than BE-bound, or just that it's more FE-bound than 0%? score() and unload() are nowhere to be seen; does that mean they are bound by something else? How would I find out?

This is what I mean by “not useful”; --run-sample seems to solve a different problem from what I'm trying to get answered. It's possible that it's useful once you know exactly which events to look for and in which function, but isn't the entire point of this method to get an overview and then use that to drill down towards the interesting information?

I've also tried -l3 --run-sample; it samples a bunch more metrics, but I can only look at one of them at a time with perf report; there's no way to compare and get an overview. E.g., it samples L1_Bound_MEM_LOAD_RETIRED_L1_HIT, which I would assume would be useful as an intermediate for calculating some metric, but in itself, I'm not sure what to do with this information (after all, nearly all well-behaved functions should hit in L1 a lot).

andikleen commented 4 years ago

On Tue, Aug 11, 2020 at 12:36:36AM -0700, Steinar H. Gunderson wrote:

Let me give you an example of how I'm trying to use it, where it doesn't help me at all. I have a program (https://github.com/quackle/quackle) where a regular perf cycles profile looks roughly like this:

33.55%  test     test                  [.] Quackle::Generator::gordongen
18.94%  test     test                  [.] Quackle::Generator::gordongoon
12.86%  test     test                  [.] std::__detail::_Map_base<Quackle::FixedLengthString, std::pair<Quackle::FixedLengthString c
 6.96%  test     test                  [.] Quackle::Board::score
 6.47%  test     test                  [.] Quackle::Rack::unload
 3.08%  test     libc-2.28.so          [.] __memmove_avx_unaligned_erms

Of this, I would expect gordongen and gordongoon to most likely be bound by bad speculation, and the std::__detail::_Map_base function to be ripe with cache misses. Rack::unload might be port bound (I'm not sure). So I'd like an idea, for each function, where it's bound (and that's what VTune would give me).

Actually it doesn't. It gives you ranges, but that's not the same.

A hit on a function doesn't mean the problem happened on that function, but it happened in a sampling interval that ended with that function. Now maybe it works with enough samples, but I believe the topdown method of using separate sampling events is far more accurate.

Most of the hits here are in the kernel, which doesn't make much sense, as the program hardly ever does any syscalls.

They are on idle CPUs.

gordgongen() and gordongoon() are in the profile, but looking at the annotation does not make me much wiser; a few test instructions are red-hot, does that mean this function is more FE-bound than BE-bound, or just that it's more FE-bound than 0%? score() and unload() are

At the sample level TopDown doesn't really exist anymore. Instead it samples for the problems directly. What it reports is that the frontend was stalled for more than 8 cycles. Since you also have bad speculation likely due to a branch misprediction on a branch leading to the test.

You could also sample for mispredicts to confirm that. Unfortunately toplev cannot do that automatically currently.

Another possible reason could be thrashing caches, but in this case it is usually wide spread and that's unlikely if your program is relatively small.

nowhere to be seen; does that mean they are bound by something else? How would I find out?

An alternative is also to look at timed LBR data. The toplev sampling already records the LBR, so you can use perf report --branch-history or perf script -F +brstackinsn --xed (needs xed installed).

Again this won't tell you exactly why, but tells you that there are stalls and in which basic blocks they are happen.g

This is what I mean by “not useful”; --run-sample seems to solve a different problem from what I'm trying to get answered. It's possible that it's useful

They're solving the same problem, but it's slightly different concepts.

toplev is not magic. It cannot do all of the analysis for you.

once you know exactly which events to look for and in which function, but isn't the entire point of this method to get an overview and then use that to drill down towards the interesting information?

Yes, but there are limits when you go from larger time scales to finer.

I've also tried -l3 --run-sample; it samples a bunch more metrics, but I can only look at one of them at a time with perf report; there's no way to compare and get an overview. E.g., it samples L1_Bound_MEM_LOAD_RETIRED_L1_HIT, which I would assume would be useful as an intermediate for calculating some metric, but in itself, I'm not sure what to do with this information (after all, nearly all well-behaved functions should hit in L1 a lot).

Yes the sample data representation is not great. It would benefit from some tools and also some processing of the samples (like VTune). There were attempts to write web frontend (with mixed results), but noone has even tried it yet for sample representation.

-Andi

aayasin commented 4 years ago

Can one log the toplev -vl3 output for this app?

sesse commented 4 years ago

Sure:

~/nmu/pmu-tools/toplev.py -vl3 ./test --mode=selfplay --seed=17234  --repetitions=1                                                   
Consider disabling nmi watchdog to minimize multiplexing
(echo 0 > /proc/sys/kernel/nmi_watchdog as root)
Will measure complete system.
Using level 3.
perf stat -x\; -e '{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0x9c,umask=0x1/,cpu/event=0xc2,umask=0x2/},{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xc2,umask=0x2/,cpu/event=0xd,umask=0x1,any=1/,cpu/event=0xe,umask=0x1/},{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0x9c,umask=0x1/,cpu/event=0xc2,umask=0x2/,cpu/event=0xe,umask=0x1/},{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0x79,umask=0x30/,cpu/event=0xc2,umask=0x2/,cpu/event=0xe,umask=0x1/},{cpu/event=0xc3,umask=0x1,edge=1,cmask=1/,cpu/event=0xc5,umask=0x0/},{cpu/event=0x9c,umask=0x1,cmask=4/,cpu/event=0xc6,umask=0x1,frontend=0x400106/},{cpu/event=0xc2,umask=0x2/,cpu/event=0xc6,umask=0x1,frontend=0x400206/,cpu/event=0xe,umask=0x1/},{cpu/event=0x3c,umask=0x0,any=1/,instructions},cycles,{cpu/event=0xa3,umask=0x14,cmask=20/,cpu/event=0xa6,umask=0x1/,cpu/event=0xa6,umask=0x40/},{cpu/event=0xa6,umask=0x2/,cpu/event=0xa6,umask=0x4/,cpu/event=0xc2,umask=0x2/},{cpu/event=0x83,umask=0x4/,cpu/event=0x87,umask=0x1/,cpu/event=0xab,umask=0x2/,cycles},{cpu/event=0x14,umask=0x1,cmask=1/,cpu/event=0x79,umask=0x30,edge=1,cmask=1/,cpu/event=0xa6,umask=0x40/,cycles},{cpu/event=0x80,umask=0x4,cmask=1,edge=1/,cpu/event=0x80,umask=0x4/,cycles},{cpu/event=0xd,umask=0x80/,cpu/event=0xe6,umask=0x1/,cycles},{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0x79,umask=0x24,cmask=1/,cpu/event=0x79,umask=0x24,cmask=4/},{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0x79,umask=0x18,cmask=1/,cpu/event=0x79,umask=0x18,cmask=4/},{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xa8,umask=0x1,cmask=1/,cpu/event=0xa8,umask=0x1,cmask=4/},{cpu/event=0xa3,umask=0x14,cmask=20/,cpu/event=0xa3,umask=0xc,cmask=12/,cycles},{cpu/event=0xa3,umask=0x5,cmask=5/,cpu/event=0xa3,umask=0x6,cmask=6/,cycles},{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xc1,umask=0x3f/,cpu/event=0xca,umask=0x1e,cmask=1/},{cpu/event=0x14,umask=0x1,cmask=1/,cpu/event=0xa6,umask=0x1/,cpu/event=0xa6,umask=0x2/},{cpu/event=0xa6,umask=0x4/,cpu/event=0xc2,umask=0x2/,cycles},{cpu/event=0xd1,umask=0x2/,cpu/event=0xd1,umask=0x40/,cpu/event=0xd1,umask=0x8/},{cpu/event=0x48,umask=0x2,cmask=1/,cpu/event=0xa3,umask=0x5,cmask=5/,cpu/event=0xa3,umask=0xc,cmask=12/},{cpu/event=0xa3,umask=0x5,cmask=5/,cpu/event=0xa3,umask=0x6,cmask=6/,cpu/event=0xa3,umask=0xc,cmask=12/,cycles},{cpu/event=0xb1,umask=0x1/,cpu/event=0xb1,umask=0x10/,cpu/event=0xc7,umask=0x2/},{cpu/event=0xc2,umask=0x2/,cpu/event=0xc7,umask=0x1/,cpu/event=0xc7,umask=0x4/},{cpu/event=0xc7,umask=0x10/,cpu/event=0xc7,umask=0x20/,cpu/event=0xc7,umask=0x8/}' -A -a ./test --mode=selfplay --seed=17234 --repetitions=1
[…regular program output skipped…]
# 4.0-full on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
C0    FE             Frontend_Bound                      % Slots                  23.86  
C0    BAD            Bad_Speculation                     % Slots                  25.52  
C0    BE             Backend_Bound                       % Slots                  13.28  
    This category represents fraction of slots where no uops are
    being delivered due to a lack of required resources for
    accepting new uops in the Backend...
C0    RET            Retiring                            % Slots                  37.59 ?
    This category represents fraction of slots utilized by
    useful work i...
    Sampling events:  uops_retired.retire_slots
C0    FE             Frontend_Bound.Frontend_Latency     % Slots                   8.74 ?
    This metric represents fraction of slots the CPU was stalled
    due to Frontend latency issues...
    Sampling events:  frontend_retired.latency_ge_16:pp frontend_retired.latency_ge_32:pp
C0    FE             Frontend_Bound.Frontend_Bandwidth   % Slots                  14.57  
    This metric represents fraction of slots the CPU was stalled
    due to Frontend bandwidth issues...
    Sampling events:  frontend_retired.latency_ge_2_bubbles_ge_1:pp
C0    BAD            Bad_Speculation.Branch_Mispredicts  % Slots                  26.17   <==
    This metric represents fraction of slots the CPU has wasted
    due to Branch Misprediction...
    Sampling events:  br_misp_retired.all_branches
C0    BAD            Bad_Speculation.Machine_Clears      % Slots                   0.08 ?
    This metric represents fraction of slots the CPU has wasted
    due to Machine Clears...
    Sampling events:  machine_clears.count
C0    BE/Mem         Backend_Bound.Memory_Bound          % Slots                   5.84 ?
    This metric represents fraction of slots the Memory
    subsystem within the Backend was a bottleneck...
C0    BE/Core        Backend_Bound.Core_Bound            % Slots                   7.45 ?
    This metric represents fraction of slots where Core non-
    memory issues were of a bottleneck...
C0    RET            Retiring.Base                       % Slots                  36.88 ?
    This metric represents fraction of slots where the CPU was
    retiring regular uops (ones not originated from the
    microcode-sequencer)...
    Sampling events:  inst_retired.prec_dist:pp
C0    RET            Retiring.Microcode_Sequencer        % Slots                   0.70 ?
    This metric represents fraction of slots the CPU was
    retiring uops fetched by the Microcode Sequencer (MS) unit...
    Sampling events:  idq.ms_uops
C0-T0 FE             Frontend_Bound.Frontend_Latency.ICache_Misses    % Clocks                  2.04 ?
    This metric represents fraction of cycles the CPU was
    stalled due to instruction cache misses...
    Sampling events:  frontend_retired.l2_miss:pp frontend_retired.l1i_miss:pp
C0-T0 FE             Frontend_Bound.Frontend_Latency.ITLB_Misses      % Clocks                  0.88 ?
    This metric represents fraction of cycles the CPU was
    stalled due to Instruction TLB (ITLB) misses...
    Sampling events:  frontend_retired.stlb_miss:pp frontend_retired.itlb_miss:pp
C0-T0 FE             Frontend_Bound.Frontend_Latency.Branch_Resteers  % Clocks_est              9.54 ?
    This metric represents fraction of cycles the CPU was
    stalled due to Branch Resteers...
    Sampling events:  br_misp_retired.all_branches
C0-T0 FE             Frontend_Bound.Frontend_Latency.DSB_Switches     % Clocks                  3.52 ?
    This metric represents fraction of cycles the CPU was
    stalled due to switches from DSB to MITE pipelines...
    Sampling events:  frontend_retired.dsb_miss:pp
C0-T0 FE             Frontend_Bound.Frontend_Latency.LCP              % Clocks                  0.01 ?
    This metric represents fraction of cycles CPU was stalled
    due to Length Changing Prefixes (LCPs)...
C0-T0 FE             Frontend_Bound.Frontend_Latency.MS_Switches      % Clocks                  0.97 ?
    This metric estimates the fraction of cycles when the CPU
    was stalled due to switches of uop delivery to the Microcode
    Sequencer (MS)...
    Sampling events:  idq.ms_switches
C0-T0 FE             Frontend_Bound.Frontend_Bandwidth.MITE           % Core_Clocks            14.90 ?
    This metric represents Core fraction of cycles in which CPU
    was likely limited due to the MITE pipeline (Legacy Decode
    Pipeline)...
    Sampling events:  frontend_retired.dsb_miss:pp
C0-T0 FE             Frontend_Bound.Frontend_Bandwidth.DSB            % Core_Clocks             8.07 ?
    This metric represents Core fraction of cycles in which CPU
    was likely limited due to DSB (decoded uop cache) fetch
    pipeline...
C0-T0 FE             Frontend_Bound.Frontend_Bandwidth.LSD            % Core_Clocks             0.00 ?
    This metric represents Core fraction of cycles in which CPU
    was likely limited due to LSD (Loop Stream Detector) unit...
C0-T0 BE/Mem         Backend_Bound.Memory_Bound.L1_Bound              % Stalls                  9.32 ?
    This metric estimates how often the CPU was stalled without
    loads missing the L1 data cache...
    Sampling events:  mem_load_retired.l1_hit:pp mem_load_retired.fb_hit:pp
C0-T0 BE/Mem         Backend_Bound.Memory_Bound.L2_Bound              % Stalls                  0.36 ?
    This metric estimates how often the CPU was stalled due to
    L2 cache accesses by loads...
    Sampling events:  mem_load_retired.l2_hit:pp
C0-T0 BE/Mem         Backend_Bound.Memory_Bound.L3_Bound              % Stalls                  2.98 ?
    This metric estimates how often the CPU was stalled due to
    loads accesses to L3 cache or contended with a sibling Core...
    Sampling events:  mem_load_retired.l3_hit:pp
C0-T0 BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound            % Stalls                  5.58 ?
    This metric estimates how often the CPU was stalled on
    accesses to external memory (DRAM) by loads...
    Sampling events:  mem_load_retired.l3_miss:pp
C0-T0 BE/Mem         Backend_Bound.Memory_Bound.Store_Bound           % Stalls                  0.12 ?
    This metric estimates how often CPU was stalled  due to
    store memory accesses...
    Sampling events:  mem_inst_retired.all_stores:pp
C0-T0 BE/Core        Backend_Bound.Core_Bound.Divider                 % Clocks                  0.05 ?
    This metric represents fraction of cycles where the Divider
    unit was active...
    Sampling events:  arith.divider_active
C0-T0 BE/Core        Backend_Bound.Core_Bound.Ports_Utilization       % Clocks                 20.32 ?
    This metric estimates fraction of cycles the CPU performance
    was potentially limited due to Core computation issues (non
    divider-related)...
C0-T0 RET            Retiring.Base.FP_Arith                           % Uops                    0.32 ?
    This metric represents overall arithmetic floating-point
    (FP) uops fraction the CPU has executed (retired)
C0-T0 RET            Retiring.Base.Other                              % Uops                   99.68 ?
    This metric represents non-floating-point (FP) uop fraction
    the CPU has executed...
C0-T0 RET            Retiring.Microcode_Sequencer.Assists             % Slots_est               0.00 ?
    This metric estimates fraction of cycles the CPU retired
    uops delivered by the Microcode_Sequencer as a result of
    Assists...
    Sampling events:  other_assists.any
C0-T0 MUX                                                             %                         3.03  
    PerfMon Event Multiplexing accuracy indicator
C1    FE             Frontend_Bound                                   % Slots                  25.99  
C1    BAD            Bad_Speculation                                  % Slots                  25.39  
C1    BE             Backend_Bound                                    % Slots                  12.46  
C1    RET            Retiring                                         % Slots                  35.60 ?
C1    FE             Frontend_Bound.Frontend_Latency                  % Slots                  10.88 ?
C1    FE             Frontend_Bound.Frontend_Bandwidth                % Slots                  14.32  
C1    BAD            Bad_Speculation.Branch_Mispredicts               % Slots                  25.86  
C1    BAD            Bad_Speculation.Machine_Clears                   % Slots                   0.09 ?
C1    BE/Mem         Backend_Bound.Memory_Bound                       % Slots                   5.73 ?
C1    BE/Core        Backend_Bound.Core_Bound                         % Slots                   6.72 ?
C1    RET            Retiring.Base                                    % Slots                  34.85 ?
C1    RET            Retiring.Microcode_Sequencer                     % Slots                   0.74 ?
C1-T0 FE             Frontend_Bound.Frontend_Latency.ICache_Misses    % Clocks                  3.02 ?
C1-T0 FE             Frontend_Bound.Frontend_Latency.ITLB_Misses      % Clocks                  1.25 ?
C1-T0 FE             Frontend_Bound.Frontend_Latency.Branch_Resteers  % Clocks_est             10.58 ?
C1-T0 FE             Frontend_Bound.Frontend_Latency.DSB_Switches     % Clocks                  3.61 ?
C1-T0 FE             Frontend_Bound.Frontend_Latency.LCP              % Clocks                  0.02 ?
C1-T0 FE             Frontend_Bound.Frontend_Latency.MS_Switches      % Clocks                  1.57 ?
C1-T0 FE             Frontend_Bound.Frontend_Bandwidth.MITE           % Core_Clocks            17.50   <==
C1-T0 FE             Frontend_Bound.Frontend_Bandwidth.DSB            % Core_Clocks             9.13 ?
C1-T0 FE             Frontend_Bound.Frontend_Bandwidth.LSD            % Core_Clocks             0.00 ?
C1-T0 BE/Mem         Backend_Bound.Memory_Bound.L1_Bound              % Stalls                  9.18 ?
C1-T0 BE/Mem         Backend_Bound.Memory_Bound.L2_Bound              % Stalls                  0.34 ?
C1-T0 BE/Mem         Backend_Bound.Memory_Bound.L3_Bound              % Stalls                  3.39 ?
C1-T0 BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound            % Stalls                  4.81 ?
C1-T0 BE/Mem         Backend_Bound.Memory_Bound.Store_Bound           % Stalls                  0.21 ?
C1-T0 BE/Core        Backend_Bound.Core_Bound.Divider                 % Clocks                  0.17 ?
C1-T0 BE/Core        Backend_Bound.Core_Bound.Ports_Utilization       % Clocks                 19.83 ?
C1-T0 RET            Retiring.Base.FP_Arith                           % Uops                    0.26 ?
C1-T0 RET            Retiring.Base.Other                              % Uops                   99.74 ?
C1-T0 RET            Retiring.Microcode_Sequencer.Assists             % Slots_est               0.00 ?
C1-T0 MUX                                                             %                         3.45  
C2    FE             Frontend_Bound                                   % Slots                  24.42  
C2    BAD            Bad_Speculation                                  % Slots                  24.82  
C2    BE             Backend_Bound                                    % Slots                  14.05  
C2    RET            Retiring                                         % Slots                  36.32 ?
C2    FE             Frontend_Bound.Frontend_Latency                  % Slots                  11.37 ?
C2    FE             Frontend_Bound.Frontend_Bandwidth                % Slots                  13.44  
C2    BAD            Bad_Speculation.Branch_Mispredicts               % Slots                  25.51   <==
C2    BAD            Bad_Speculation.Machine_Clears                   % Slots                   0.09 ?
C2    BE/Mem         Backend_Bound.Memory_Bound                       % Slots                   6.65 ?
C2    BE/Core        Backend_Bound.Core_Bound                         % Slots                   7.40 ?
C2    RET            Retiring.Base                                    % Slots                  35.49 ?
C2    RET            Retiring.Microcode_Sequencer                     % Slots                   0.83 ?
C2-T0 FE             Frontend_Bound.Frontend_Latency.ICache_Misses    % Clocks                  1.71 ?
C2-T0 FE             Frontend_Bound.Frontend_Latency.ITLB_Misses      % Clocks                  0.85 ?
C2-T0 FE             Frontend_Bound.Frontend_Latency.Branch_Resteers  % Clocks_est             10.02 ?
C2-T0 FE             Frontend_Bound.Frontend_Latency.DSB_Switches     % Clocks                  3.64 ?
C2-T0 FE             Frontend_Bound.Frontend_Latency.LCP              % Clocks                  0.01 ?
C2-T0 FE             Frontend_Bound.Frontend_Latency.MS_Switches      % Clocks                  1.34 ?
C2-T0 FE             Frontend_Bound.Frontend_Bandwidth.MITE           % Core_Clocks            29.41  
C2-T0 FE             Frontend_Bound.Frontend_Bandwidth.DSB            % Core_Clocks            16.68  
C2-T0 FE             Frontend_Bound.Frontend_Bandwidth.LSD            % Core_Clocks             0.00 ?
C2-T0 BE/Mem         Backend_Bound.Memory_Bound.L1_Bound              % Stalls                  9.35 ?
C2-T0 BE/Mem         Backend_Bound.Memory_Bound.L2_Bound              % Stalls                  0.31 ?
C2-T0 BE/Mem         Backend_Bound.Memory_Bound.L3_Bound              % Stalls                  2.91 ?
C2-T0 BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound            % Stalls                  3.68 ?
C2-T0 BE/Mem         Backend_Bound.Memory_Bound.Store_Bound           % Stalls                  0.12 ?
C2-T0 BE/Core        Backend_Bound.Core_Bound.Divider                 % Clocks                  0.10 ?
C2-T0 BE/Core        Backend_Bound.Core_Bound.Ports_Utilization       % Clocks                 19.82 ?
C2-T0 RET            Retiring.Base.FP_Arith                           % Uops                    0.31 ?
C2-T0 RET            Retiring.Base.Other                              % Uops                   99.69 ?
C2-T0 RET            Retiring.Microcode_Sequencer.Assists             % Slots_est               0.00 ?
C2-T0 MUX                                                             %                         3.44  
C3    FE             Frontend_Bound                                   % Slots                  24.17  
C3    BAD            Bad_Speculation                                  % Slots                  26.57  
C3    BE             Backend_Bound                                    % Slots                  13.03  
C3    RET            Retiring                                         % Slots                  37.57 ?
C3    FE             Frontend_Bound.Frontend_Latency                  % Slots                  10.04 ?
C3    FE             Frontend_Bound.Frontend_Bandwidth                % Slots                  15.19  
C3    BAD            Bad_Speculation.Branch_Mispredicts               % Slots                  25.33   <==
C3    BAD            Bad_Speculation.Machine_Clears                   % Slots                   0.11 ?
C3    BE/Mem         Backend_Bound.Memory_Bound                       % Slots                   6.04 ?
C3    BE/Core        Backend_Bound.Core_Bound                         % Slots                   6.99 ?
C3    RET            Retiring.Base                                    % Slots                  36.64 ?
C3    RET            Retiring.Microcode_Sequencer                     % Slots                   0.92 ?
C3-T0 FE             Frontend_Bound.Frontend_Latency.ICache_Misses    % Clocks                  1.91 ?
C3-T0 FE             Frontend_Bound.Frontend_Latency.ITLB_Misses      % Clocks                  0.80 ?
C3-T0 FE             Frontend_Bound.Frontend_Latency.Branch_Resteers  % Clocks_est              9.38 ?
C3-T0 FE             Frontend_Bound.Frontend_Latency.DSB_Switches     % Clocks                  3.52 ?
C3-T0 FE             Frontend_Bound.Frontend_Latency.LCP              % Clocks                  0.01 ?
C3-T0 FE             Frontend_Bound.Frontend_Latency.MS_Switches      % Clocks                  1.08 ?
C3-T0 FE             Frontend_Bound.Frontend_Bandwidth.MITE           % Core_Clocks            26.77  
C3-T0 FE             Frontend_Bound.Frontend_Bandwidth.DSB            % Core_Clocks            14.86 ?
C3-T0 FE             Frontend_Bound.Frontend_Bandwidth.LSD            % Core_Clocks             0.00 ?
C3-T0 BE/Mem         Backend_Bound.Memory_Bound.L1_Bound              % Stalls                  8.66 ?
C3-T0 BE/Mem         Backend_Bound.Memory_Bound.L2_Bound              % Stalls                  0.40 ?
C3-T0 BE/Mem         Backend_Bound.Memory_Bound.L3_Bound              % Stalls                  2.76 ?
C3-T0 BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound            % Stalls                  4.90 ?
C3-T0 BE/Mem         Backend_Bound.Memory_Bound.Store_Bound           % Stalls                  0.16 ?
C3-T0 BE/Core        Backend_Bound.Core_Bound.Divider                 % Clocks                  0.05 ?
C3-T0 BE/Core        Backend_Bound.Core_Bound.Ports_Utilization       % Clocks                 19.58 ?
C3-T0 RET            Retiring.Base.FP_Arith                           % Uops                    0.33 ?
C3-T0 RET            Retiring.Base.Other                              % Uops                   99.67 ?
C3-T0 RET            Retiring.Microcode_Sequencer.Assists             % Slots_est               0.00 ?
C3-T0 MUX                                                             %                         3.45  
C0-T1 FE             Frontend_Bound.Frontend_Latency.ICache_Misses    % Clocks                  1.42 ?
C0-T1 FE             Frontend_Bound.Frontend_Latency.ITLB_Misses      % Clocks                  0.51 ?
C0-T1 FE             Frontend_Bound.Frontend_Latency.Branch_Resteers  % Clocks_est              8.89 ?
C0-T1 FE             Frontend_Bound.Frontend_Latency.DSB_Switches     % Clocks                  3.69 ?
C0-T1 FE             Frontend_Bound.Frontend_Latency.LCP              % Clocks                  0.01 ?
C0-T1 FE             Frontend_Bound.Frontend_Latency.MS_Switches      % Clocks                  1.32 ?
C0-T1 FE             Frontend_Bound.Frontend_Bandwidth.MITE           % Core_Clocks            26.52 ?
C0-T1 FE             Frontend_Bound.Frontend_Bandwidth.DSB            % Core_Clocks            14.60 ?
C0-T1 FE             Frontend_Bound.Frontend_Bandwidth.LSD            % Core_Clocks             0.00 ?
C0-T1 BE/Mem         Backend_Bound.Memory_Bound.L1_Bound              % Stalls                  8.74 ?
C0-T1 BE/Mem         Backend_Bound.Memory_Bound.L2_Bound              % Stalls                  0.35 ?
C0-T1 BE/Mem         Backend_Bound.Memory_Bound.L3_Bound              % Stalls                  2.80 ?
C0-T1 BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound            % Stalls                  4.10 ?
C0-T1 BE/Mem         Backend_Bound.Memory_Bound.Store_Bound           % Stalls                  0.08 ?
C0-T1 BE/Core        Backend_Bound.Core_Bound.Divider                 % Clocks                  0.06 ?
C0-T1 BE/Core        Backend_Bound.Core_Bound.Ports_Utilization       % Clocks                 21.08 ?
C0-T1 RET            Retiring.Base.FP_Arith                           % Uops                    0.30 ?
C0-T1 RET            Retiring.Base.Other                              % Uops                   99.70 ?
C0-T1 RET            Retiring.Microcode_Sequencer.Assists             % Slots_est               0.00 ?
C0-T1 MUX                                                             %                         3.45  
C1-T1 FE             Frontend_Bound.Frontend_Latency.ICache_Misses    % Clocks                  1.65 ?
C1-T1 FE             Frontend_Bound.Frontend_Latency.ITLB_Misses      % Clocks                  0.63 ?
C1-T1 FE             Frontend_Bound.Frontend_Latency.Branch_Resteers  % Clocks_est              9.85 ?
C1-T1 FE             Frontend_Bound.Frontend_Latency.DSB_Switches     % Clocks                  3.63 ?
C1-T1 FE             Frontend_Bound.Frontend_Latency.LCP              % Clocks                  0.00 ?
C1-T1 FE             Frontend_Bound.Frontend_Latency.MS_Switches      % Clocks                  1.01 ?
C1-T1 FE             Frontend_Bound.Frontend_Bandwidth.MITE           % Core_Clocks            22.00 ?
C1-T1 FE             Frontend_Bound.Frontend_Bandwidth.DSB            % Core_Clocks            12.50 ?
C1-T1 FE             Frontend_Bound.Frontend_Bandwidth.LSD            % Core_Clocks             0.00 ?
C1-T1 BE/Mem         Backend_Bound.Memory_Bound.L1_Bound              % Stalls                  9.35 ?
C1-T1 BE/Mem         Backend_Bound.Memory_Bound.L2_Bound              % Stalls                  0.35 ?
C1-T1 BE/Mem         Backend_Bound.Memory_Bound.L3_Bound              % Stalls                  2.19 ?
C1-T1 BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound            % Stalls                  3.49 ?
C1-T1 BE/Mem         Backend_Bound.Memory_Bound.Store_Bound           % Stalls                  0.11 ?
C1-T1 BE/Core        Backend_Bound.Core_Bound.Divider                 % Clocks                  0.05 ?
C1-T1 BE/Core        Backend_Bound.Core_Bound.Ports_Utilization       % Clocks                 18.94 ?
C1-T1 RET            Retiring.Base.FP_Arith                           % Uops                    0.35 ?
C1-T1 RET            Retiring.Base.Other                              % Uops                   99.65 ?
C1-T1 RET            Retiring.Microcode_Sequencer.Assists             % Slots_est               0.00 ?
C1-T1 MUX                                                             %                         3.44  
C2-T1 FE             Frontend_Bound.Frontend_Latency.ICache_Misses    % Clocks                  3.06 ?
C2-T1 FE             Frontend_Bound.Frontend_Latency.ITLB_Misses      % Clocks                  1.58 ?
C2-T1 FE             Frontend_Bound.Frontend_Latency.Branch_Resteers  % Clocks_est             10.80 ?
C2-T1 FE             Frontend_Bound.Frontend_Latency.DSB_Switches     % Clocks                  3.28 ?
C2-T1 FE             Frontend_Bound.Frontend_Latency.LCP              % Clocks                  0.02 ?
C2-T1 FE             Frontend_Bound.Frontend_Latency.MS_Switches      % Clocks                  1.44 ?
C2-T1 FE             Frontend_Bound.Frontend_Bandwidth.MITE           % Core_Clocks            11.73 ?
C2-T1 FE             Frontend_Bound.Frontend_Bandwidth.DSB            % Core_Clocks             6.09 ?
C2-T1 FE             Frontend_Bound.Frontend_Bandwidth.LSD            % Core_Clocks             0.00 ?
C2-T1 BE/Mem         Backend_Bound.Memory_Bound.L1_Bound              % Stalls                  9.27 ?
C2-T1 BE/Mem         Backend_Bound.Memory_Bound.L2_Bound              % Stalls                  0.57 ?
C2-T1 BE/Mem         Backend_Bound.Memory_Bound.L3_Bound              % Stalls                  3.27 ?
C2-T1 BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound            % Stalls                  7.19 ?
C2-T1 BE/Mem         Backend_Bound.Memory_Bound.Store_Bound           % Stalls                  0.20 ?
C2-T1 BE/Core        Backend_Bound.Core_Bound.Divider                 % Clocks                  0.21 ?
C2-T1 BE/Core        Backend_Bound.Core_Bound.Ports_Utilization       % Clocks                 19.38 ?
C2-T1 RET            Retiring.Base.FP_Arith                           % Uops                    0.36 ?
C2-T1 RET            Retiring.Base.Other                              % Uops                   99.64 ?
C2-T1 RET            Retiring.Microcode_Sequencer.Assists             % Slots_est               0.00 ?
C2-T1 MUX                                                             %                         3.44  
C3-T1 FE             Frontend_Bound.Frontend_Latency.ICache_Misses    % Clocks                  3.55 ?
C3-T1 FE             Frontend_Bound.Frontend_Latency.ITLB_Misses      % Clocks                  1.58 ?
C3-T1 FE             Frontend_Bound.Frontend_Latency.Branch_Resteers  % Clocks_est             10.78 ?
C3-T1 FE             Frontend_Bound.Frontend_Latency.DSB_Switches     % Clocks                  3.34 ?
C3-T1 FE             Frontend_Bound.Frontend_Latency.LCP              % Clocks                  0.00 ?
C3-T1 FE             Frontend_Bound.Frontend_Latency.MS_Switches      % Clocks                  2.06 ?
C3-T1 FE             Frontend_Bound.Frontend_Bandwidth.MITE           % Core_Clocks            13.07 ?
C3-T1 FE             Frontend_Bound.Frontend_Bandwidth.DSB            % Core_Clocks             7.09 ?
C3-T1 FE             Frontend_Bound.Frontend_Bandwidth.LSD            % Core_Clocks             0.00 ?
C3-T1 BE/Mem         Backend_Bound.Memory_Bound.L1_Bound              % Stalls                  9.03 ?
C3-T1 BE/Mem         Backend_Bound.Memory_Bound.L2_Bound              % Stalls                  0.40 ?
C3-T1 BE/Mem         Backend_Bound.Memory_Bound.L3_Bound              % Stalls                  4.09 ?
C3-T1 BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound            % Stalls                  5.89 ?
C3-T1 BE/Mem         Backend_Bound.Memory_Bound.Store_Bound           % Stalls                  0.22 ?
C3-T1 BE/Core        Backend_Bound.Core_Bound.Divider                 % Clocks                  0.15 ?
C3-T1 BE/Core        Backend_Bound.Core_Bound.Ports_Utilization       % Clocks                 19.24 ?
C3-T1 RET            Retiring.Base.FP_Arith                           % Uops                    0.32 ?
C3-T1 RET            Retiring.Base.Other                              % Uops                   99.68 ?
C3-T1 RET            Retiring.Microcode_Sequencer.Assists             % Slots_est               0.00 ?
C3-T1 MUX                                                             %                         3.44  
warning: 5 results not referenced:7 9 10 75 77
andikleen commented 4 years ago

On the problem with the idle CPUs:

toplev has to measure all CPUs because you have SMT enabled and it needs to get data from both siblings.

With SMT disabled (e.g. cputop.py 'thread == 1' offline | sh) it would only measure the process.

There's a tool "utilized.py" that can filter out idle CPUs from the CSV output files. Probably toplev should integrate this in the standard output.

sesse commented 4 years ago

I found that VTune actually exists for Linux and is free these days, so for reference, here's the same program:

vtune1 vtune2

Given that VTune solves my problem, feel free to close the bug.

aayasin commented 4 years ago

I think there are few improvements that toplev can consider given this issue:

aayasin commented 4 years ago

I found that VTune actually exists for Linux and is free these days, so for reference, here's the same program:

vtune1 vtune2

Given that VTune solves my problem, feel free to close the bug.

Please clarify how VTune did helped: was it since close to half of the time is spent on gordongen() top hotpot ? or since you see 36% Bad Speculation for that hotspot? or did you used the BR_MISP_RETIRED.ALL_BRANCHES event to locate the offending branch in Src/Adm view? Screenshot of the latter would help.

Did you managed to speedup your code at the end and how if so?

High Bad_Speculation and Frontend_Bound are typically signs to attempt a PGO compiler. Also, it is a long tail profile, if we leave gordongen() and gordongoon() aside.

sesse commented 4 years ago

I already knew (from perf) that gordongen() was top. But it was useful to know that bad speculation was the top reason for most of the hot functions; I had not thought that. (I am also curious at why gordongoon() is front-end bound; it is very unexpected to me, so I'll need to look a bit more closely.)

Given that I only had VTune running yesterday evening and are working full-time with something completely different, I can't claim to see any concrete optimizations coming out of this, no. But it already saved me from going down a few wild goose chases, which is useful in itself. I haven't looked a lot at VTune's assembler output yet; it's quite possible that perf is already better there for my uses.

sesse commented 4 years ago

Did you managed to speedup your code at the end and how if so?

Update: I played a bit with the code and profiles tonight; I noticed that it said that parts of gordongen() was L1 bound, which I didn't even know was possible. I used that to reduce some of the memory traffic, leading to a ~4% overall speedup (10%-ish to the function). Fairly happy about it.

I still have no idea about the MITE frontend stalls, and I'll probably have to go educate myself a bit on that part before I can understand what's going on. :-) The per-line samples in VTune are a tad too broad to be immediately useful here (multiplexing spread too thin).

aayasin commented 4 years ago

Awesome. Glad to hear you had a win 👍

Quantifying the cost of branch mispredictions is tricky in modern out-of-order cores. Throughout the TMA method, the cost is distributed under Bad Speculation as well as Branch Resteers placed under Frontend Bound (e.g. it may overlap with fetch misses). Section 3 of my ISPASS'14 paper explains this and other aspects of the microarchitecture.

MITE is an Intel acronym for the traditional instruction fetch pipeline that handles i-cache/tlb misses, identifies instructions in cacheline bytes, decodes them into micro-operations (uop) that are consumable by the back-end of the pipeline. If hot, these uops may be further cached in the uop cache (DSB) which enables an efficient future fetch. The fetch latency of the corrected path after a branch misprediction could very well be related to your profile as it adds to the cost of a single misprediction (but I would first attempt to reduce the mispredictions themselves if possible).

I am happy to analyze this but we (Andi) could not reproduce at our end. VTune has an option to dump a tar for remote analysis, try: /opt/intel/vtune_profiler_2020/bin64/amplxe-cl --archive -r /path/to/r004ue or, expand the Frontend Latency, Frontend Bandwidth and Memory Bound columns in the grid (bottom-up) snapshot of above the r004ue profile.

sesse commented 4 years ago

Mm, OK, so high numbers in MITE essentially says “decode took time for some reason, and your code wasn't in the DSB”? (Possibly because there wasn't room.) These things are confusing for those of us who are not microarchitecture experts :-) This is maybe the primary reason why I'm interested in the top-down methodology; it takes some of those more obscure counters and package them up into more visible totals.

I'll be happy to make an archive (I stashed my optimizations and re-ran the profile, still showing the MITE issues), but where is the output of the --archive? Ie., what do I send to you?

aayasin commented 4 years ago

I'll be happy to make an archive (I stashed my optimizations and re-ran the profile, still showing the MITE issues), but where is the output of the --archive? Ie., what do I send to you?

I believe VTune places a tar/zip file next to the results folder saved under your home directory.

sesse commented 4 years ago

It doesn't seem to do so; I even straced it to look. I think it maybe just copies all the relevant binaries into the project directory, so you can tar it up yourself. I've tried that; see if https://storage.sesse.net/r005ue.tar.gz works for you.

aayasin commented 4 years ago

It doesn't seem to do so; I even straced it to look. I think it maybe just copies all the relevant binaries into the project directory, so you can tar it up yourself. I've tried that; see if https://storage.sesse.net/r005ue.tar.gz works for you.

I analyzed the profile today. 37% of Clocks (CPU_CLK_UNHALTED.THREAD event) are spent in Quackle::Generator::gordongen() where 46% of the TMA slots in that function are wasted due to branch misprediction. 46% is estimated sum of Bad Speculation and Frontend_Bound.Frontend_Latency which is mostly due to Branch_Resteers.

When double-clicking that line in the Bottom-up (grid) view, it goes directly to Asm View (source info is unavailable - see 2nd bullet below) and lands on Block 20. The JCC (conditional branch) at the end of that block (jle 0x3c506 at address 0x3c452) accounts for 21% of overall branch mispredictions (using the BR_MISP_RETIRED.ALL_BRANCHES precise event). That whole Block + one more instruction either direction (where the misprediction cost would pile up) account for 24% of total Clocks! That is very compelling to attack.

Throughout manual calculation for this JCC, I estimate it is taken (i.e. directly jumps to Block 26) in 79% of cases. Thus flipping its direction might help. I'd love to see a focused experiment of that and how the source code of the loop from Block 20 to Block 27 look like (I guess it walks a list and does some calculation when branching to Block 23). E.g. in the source code, you may try to flip the condition this JCC checks for.

In general the above data proofs that a PGO compiler is likely to speedup your workload.

If you profile & share results once more, please

I hope this help.

sesse commented 4 years ago

The code is open, so it's readily available: https://github.com/sesse/quackle/blob/optimize/generator.cpp#L607

I am not as optimistic about PGO as you are; it's traversing a large (~10 MB, optimized down from 80 MB by some earlier patches) trie-like structure, searching for different words all the time. The branch patterns are heavily data-driven, so I doubt doing anything about the basic block layout is going to change anything. But sure, let me try it. Here's a base run of the current code (fixed seed):

   174 780 011 087      cycles                                                      
    48 359 047 227      branches                                                    
     2 586 254 222      branch-misses             #    5,35% of all branches        

      55,683634286 seconds time elapsed

I added -fprofile-generate, ran it once with the same seed, recompiled with -fprofile-use, and measured again with the same seed:

   176 921 490 295      cycles                                                      
    47 003 684 435      branches                                                    
     2 881 304 530      branch-misses             #    6,13% of all branches        

      57,385946495 seconds time elapsed

So branch misses actually goes up by enabling PGO, and the binary becomes slower. (I ran a few times to check, and while there is some noise, it's pretty clear PGO isn't helping here.) LTO+PGO fares no better:

   177 836 671 305      cycles                                                      
    45 456 292 754      branches                                                    
     2 958 194 525      branch-misses             #    6,51% of all branches        

      57,577303444 seconds time elapsed

I must add that this is a pretty common pattern I see; everybody's success stories about LTO and PGO sound very nice, but it causes a bunch of nonreproducibility and build system issues, and I don't often see it do much good in practice :-)

My example binary was indeed compiled with -g, but without source, of course, it's not that useful.

aayasin commented 4 years ago

I did found the offending loop and want to do the experiment myself.

My example binary was indeed compiled with -g, but without source, of course, it's not that useful.

I spent some time trying to install things per README (which was not straightforward) to reproduce it on my end. I ran out of clues for this error : `quackle/quacker$ make CXXFLAGS+="-g -std=c++11" CFLAGS+=-g

g++ -c -g -std=c++11 -I/usr/share/qt4/mkspecs/linux-g++-64 -I. -I/usr/include/qt4/QtCore -I/usr/include/qt4/QtGui -I/usr/include/qt4 -I. -I.. -Imoc -o obj/release/bagdisplay.o bagdisplay.cpp

bagdisplay.cpp:21:21: fatal error: QtWidgets: No such file or directory compilation terminated.

Makefile:553: recipe for target 'obj/release/bagdisplay.o' failed make: *** [obj/release/bagdisplay.o] Error 1 `

despite multiple google attempts for how to let #include <QtWidgets> work! Please help.

I suspect I do not need the GUI to reproduce it. Please tell how to generate the test/test binary with whatever required to run your quackle/test/test --mode=selfplay --computer='Championship Player' --seed=17234 --repetitions=1

These are the key steps I did with my system versions: git clone https://github.com/sesse/quackle cd quackle/ sudo apt-get install qt4-qmake qmake quackle.pro make CXXFLAGS+="-g -std=c++11" CFLAGS+=-g cd quackleio && qmake sudo apt-get install libqt4-dev make CXXFLAGS+="-g -std=c++11" CFLAGS+=-g cd .. cd quacker/ qmake make CXXFLAGS+="-g -std=c++11" CFLAGS+=-g $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=16.04 DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS" $ uname -a Linux arch-ubt02 5.3.0 #1 SMP Thu Sep 26 21:17:29 IDT 2019 x86_64 x86_64 x86_64 GNU/Linux $ qmake -v QMake version 2.01a Using Qt version 4.8.7 in /usr/lib/x86_64-linux-gnu

Note the default make files do emit -g by default.

sesse commented 4 years ago

It's a mess, indeed. I didn't make the build system, or really the software at all—I must confess that I generally solve build system issues with a bunch of make clean + make -j4 + ccache. I can send you my working directory that at least has some Makefiles pregenerated, it if helps? (And lots of crap in git stash :-) )

FWIW, you are using Qt4, and I believe Quackle wants Qt5.

aayasin commented 4 years ago

Aha.. Andi was right too, now we all are saying the reproduction is a mess :)

Please tell steps I need to do to be able to create the test/test binary myself. I hope I do not need the Qt GUI thing at all. I'm afraid I cannot spend more time on this stuff.

I just want to be able the smallest reproducer myself, so I can add -g and tweak the code Generator::gordongen() a bit for a designated experiment.

sesse commented 4 years ago

Don't put too much energy into it; it's not an important project, it's mainly me trying to expand my profiling skill set (in particular, around toplev :-) ).

You can have a version completely without any build system. It does not require Qt5 GUI, but it requires Qt5 core.

cd test ; g++ -g -O2 -DNDEBUG -o test testmain.cpp testharness.cpp trademarkedboards.cpp -I.. $( pkg-config --libs --cflags Qt5Core ) -fPIC ../quackleio/util.cpp ../quackleio/gcgio.cpp ../datamanager.cpp ../lexiconparameters.cpp ../strategyparameters.cpp ../alphabetparameters.cpp ../catchall.cpp ../gameparameters.cpp ../evaluator.cpp ../rack.cpp ../move.cpp ../playerlist.cpp ../quackleio/gaddagfactory.cpp ../quackleio/froggetopt.cpp ../quackleio/flexiblealphabet.cpp ../boardparameters.cpp ../quackleio/dictfactory.cpp ../quackleio/dictimplementation.cpp ../generator.cpp ../bag.cpp ../board.cpp ../game.cpp ../player.cpp ../enumerator.cpp ../computerplayer.cpp ../quackleio/dict.cpp ../sim.cpp ../computerplayercollection.cpp ../reporter.cpp ../resolvent.cpp ../endgameplayer.cpp ../bogowinplayer.cpp ../preendgame.cpp ../endgame.cpp ../clock.cpp -pthread

You also want to download https://storage.sesse.net/twl06.gaddag and put it into data/lexica/, or will use a different data structure which is much slower.

andikleen commented 4 years ago

On Tue, Aug 18, 2020 at 01:22:19PM +0000, aayasin wrote:

Aha.. Andi was right too, now we all are saying the reproduction is a mess :)

Please tell steps I need to do to be able to create the test/test binary myself. I hope I do not need the Qt GUI thing at all. I'm afraid I cannot spend more time on this stuff.

I just want to be able the smallest reproducer myself, so I can add -g and tweak the code Generator::gordongen() a bit for a designated experiment.

It should build on trent. I installed the needed packages here.

Otherwise would need the right QT5 devel package installed.

andikleen commented 4 years ago

The idle CPU filter is implemented now, as well as improvements for the sample event selection. For the branch misprediction bottleneck it would now sample branch mispredicts automatically.

aayasin commented 4 years ago

Quantifying the cost of branch mispredictions is tricky in modern out-of-order cores. Throughout the TMA method, the cost is distributed under Bad Speculation as well as Branch Resteers placed under Frontend Bound (e.g. it may overlap with fetch misses). Section 3 of my ISPASS'14 paper explains this and other aspects of the microarchitecture.

@sesse @andikleen I am happy to share that with the brand new toplev.py --drilldown mode , the tool is able to automatically zoom-in and conclude that branch misprediction is the key issue of this workload. In the first run Bad Speculation dominates. In the 2nd run it realizes that was due Branch Mispredicts, which let it do a 3rd run to collect the Branch Resteers portion - that I mentioned earlier.

Also, with TMA metrics 4.11, I included a new informative metric called Mispredictions that aggregates the penalty across the tree. it is at 32.7% of all slots of this workload. That is, third of the slots are utilized by useful work, another third is thrown away due to mispreidctions and the remaining 3rd are (mostly other fetch-related) stalls. Wow!

Demo of the new mode and output is logged here.

$ toplev.py --no-desc --nodes '+CoreIPC,+UPI,+CPU_Utilization,+Time,+MUX' --drilldown -V run-toplev--drilldown.csv -- ./run.sh  2>&1

Using level 1.                                                                                                                                                                         
Game 0 played in 37 seconds with 24 moves                                                                                                                                              
# 4.11-full-perf on Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz [skl]                                                                                                                     
FE             Frontend_Bound   % Slots                   24.6                                                                                                                         
BAD            Bad_Speculation  % Slots                   29.4  <==                                                                                                                    
Info.Core      CoreIPC            CoreMetric               1.5                                                                                                                         
Info.Thread    UPI                Metric                   1.0                                                                                                                         
Info.System    CPU_Utilization    Metric                   1.0                                                                                                                         
Info.System    Time               Seconds                 37.5           

Adding --nodes '!+Bad_Speculation*/2' for breakdown on the bottleneck.
Rerunning workload                                                                                                                                                                     
Game 0 played in 37 seconds with 24 moves                                                                                                                                              
BAD            Bad_Speculation                     % Slots                   29.5                                                                                                      
BAD            Bad_Speculation.Branch_Mispredicts  % Slots                   29.4  <==                                                                                                 
Info.Core      CoreIPC                               CoreMetric               1.5                                                                                                      
Info.Thread    UPI                                   Metric                   1.0                                                                                                      
Info.System    CPU_Utilization                       Metric                   1.0                                                                                                      
Info.System    Time                                  Seconds                 37.3
MUX                                                %                         49.0

Adding --nodes '!+Bad_Speculation.Branch_Mispredicts^' for breakdown on the bottleneck.
Rerunning workload
Game 0 played in 37 seconds with 24 moves
FE             Frontend_Bound.Fetch_Latency.Branch_Resteers.Mispredicts_Resteers  % Clocks                   6.9
BAD            Bad_Speculation.Branch_Mispredicts                                 % Slots                   28.6  <==
Info.Bottleneck Mispredictions                                                      Slots                   32.7
Info.Core      Branch_Misprediction_Cost                                            CoreMetric              80.4
Info.Core      IpMispredict                                                         Metric                  91.3
BAD            Bad_Speculation                                                    % Slots                   28.7
Info.Core      CoreIPC                                                              CoreMetric               1.5
Info.Thread    UPI                                                                  Metric                   0.9
Info.System    CPU_Utilization                                                      Metric                   1.0
Info.System    Time                                                                 Seconds                 37.8
MUX                                                                               %                          6.0

Sampling:
/home/lab_arch/perf record -g -e cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/pp,cpu/event=0xd,umask=0x80,name=Mispredicts_Resteers_INT_MISC_CLEAR_RESTEER_CYCLES,period=2000003/,cycles:pp -o perf.data.2 ./run.sh

$ cat ./run.sh
taskset 0xC ~/quackle/test/test --mode=selfplay --computer='Championship Player' --seed=17234  --repetitions=1 > quackle.log 2>&1
tail -1 quackle.log
andikleen commented 4 years ago

@aayasin Thanks Ahmad for suggesting this useful feature.