andreas-abel / uiCA

uops.info Code Analyzer
GNU Affero General Public License v3.0
230 stars 16 forks source link

Averaged port use across iterations? #7

Closed chriselrod closed 3 years ago

chriselrod commented 3 years ago

Here is an example with 27 fma instructions, running 13 of them on port 0 and 14 on port 5. This leads to an estimated throughput of 14 (the maximum of the two).

LLVM-MCA and OSACA predict ports 0 and 5 will be used evenly. Naively, I'd expect that to be the case when averaging across many iterations of the loop. However, I'm unfamiliar with how these pipeline models work, so at the very least different predictions are interesting.

Here is some very similar assembly (difference being address calculations were fused with the loads), where now all three tools predict work to be evenly split across ports 0 and 5.

Something else interesting is that LLVM-MCA claims the broadcasts require more ports than claimed by uiCA, leading to lower throughput estimates from it.

However, LLVM-MCA is also definitely wrong about port use in some cases where uiCA is right (AVX512 floating point on Ice Lake client / Tiger Lake / Rocket Lake).

andreas-abel commented 3 years ago

You can find some details on how the ports are assigned in section 2.12 in our paper https://arxiv.org/pdf/2107.14210.pdf

Something else interesting is that LLVM-MCA claims the broadcasts require more ports than claimed by uiCA, leading to lower throughput estimates from it.

You can find the microbenchmarks that were used to determine the port usage of individual instructions (along with corresponding measurement results on the actual hardware) if you click on the links in uiCA's output. For the broadcast instructions, for example, you can see at https://www.uops.info/html-tp/CLX/VBROADCASTSD_ZMM_M64-Measurements.html that they just use 1 uop.

chriselrod commented 3 years ago

Great work and thanks for the paper.

I've slowly been working on rewriting a loop optimization library (LoopVectorization.jl). Among the transforms I'd like to have it consider is balancing mixtures of function variants, e.g. vfdivpd vs approximations such as vrcp14pd + a couple NR iterations, or unrolling and mixing special function implementations where the different options may use different mixes of expensive functions such as vgatherpd for look up tables and vfdivpd vs the reciprocal approximations (and also featuring variable polynomial sizes in the kernels). The relative costs of vfdivpd, vgatherpd, and fma instructions vary a lot from one architecture to another. It'd probably be possible to tune some simple heuristics to generally do fairly well (it is striking how well the baseline model did relative to many of the much more sophisticated alternatives), but the accurate sort of modeling done by uiCA sounds like an ideal solution.

As another example of where it does better than LoopVectorization.jl and other tools currently is a simple summation. LoopVectorization's extremely simple modeling (basically: dep chain instruction latency / dep chain instruction throughput = 4 / 0.5 = unroll by 8) favors unrolling the summation by 8. This is in agreement with llvm-mca and OSACA, but nut uiCA: Unrolling by 4: https://bit.ly/3A2SbPu Unrolling by 8: https://bit.ly/2VgKLcA

llvm-ca and OSACA predict roughly the same throughput for both loops unrolling by 4 and 8. The loop unrolling by 8 requires half the iterations, making it theoretically up to twice as fast. uiCA instead predicts r-throughputs of 2.5 and 4.67, respectively, meaning unrolling by 8 is predicted to only buy about a 7% improvement.

2.5 looks suspiciously low, like a value of <0.5 for a loop that has a memory operand. Given the latency of 4 cycles for vaddpd, how is a throughput of <4 possible when each iteration depends on the previous?

Are such loop carried dependencies modeled by uiCA?

If I benchmark sums with 4096 elements (32 KiB, the L1d cache size of a Cascadelake CPU):

julia> @benchmark sum4($x4096) # unrolled by 4
BenchmarkTools.Trial: 10000 samples with 864 evaluations.
 Range (min … max):  137.405 ns … 187.853 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     137.625 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   137.806 ns ±   0.878 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▆ ▇█  ▆
  ▂▂▂▃█▃██▅▄█▂▃▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▂▂▂▂▂▃▃▃▃▃▃▃▂▂ ▃
  137 ns           Histogram: frequency by time          139 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum8($x4096) # unrolled by 8
BenchmarkTools.Trial: 10000 samples with 957 evaluations.
 Range (min … max):  88.784 ns … 150.542 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     90.266 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   90.187 ns ±   1.108 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▇█    ▃▃▂         ▆▄       ▂▃
  ▁▁▁▁▁▂▂▂▄██▅▂▁▃███▇▂▂▂▁▁▁▂▅██▇▂▂▅▄▃▄███▄▄▄▂▂▂▂▃▂▂▂▂▂▁▁▂▁▁▁▁▁ ▃
  88.8 ns         Histogram: frequency by time           92 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I see a roughly 50% advantage, falling in between uiCA and the other tool's predictions.

The benchmark draws a large number of samples, but it probably reaches a different steady state. Each individual loop only consists of 128 or 64 iterations (when unrolled by 4 and 8, respectively), before switching basic blocks (i.e. to blocks reducing the vector accumulators to a scalar and initializing the vector accumulators).

That's a far cry from the 10_000 iterations you used to achieve steady state.

Either way, uiCA still favors unrolling by 8 here and would result in making the same decision.

But this also brings up some questions over assumptions. In this example, the assumptions of steady state and memory reads being in the L1d are in conflict (increasing the number of iterations means data will no longer fit in the L1d). I'm not sure what a compiler should do here. For some reason, unrolling by 4 is faster when data doesn't fit in the l1 cache. Summing all lengths from 1 to 2048 in a random order takes about 92 vs 77 microseconds when unrolled by 4 vs 8.

Have any ideas or suggestions on what sort of modeling should be done here for making decisions like by how much to unroll a loop?

andreas-abel commented 3 years ago

2.5 looks suspiciously low, like a value of <0.5 for a loop that has a memory operand.

Indeed. You discovered a bug that should now be fixed.