camel-cdr / rvv-bench

A collection of RISC-V Vector (RVV) benchmarks to help developers write portably performant RVV code
MIT License
89 stars 13 forks source link

Add benches for strided load/store with different strides #12

Open wangpc-pp opened 7 months ago

wangpc-pp commented 7 months ago

Just found an issue on K230 when doing some auto-vectorization tests on https://github.com/UoB-HPC/TSVC_2. The vectorized s1115 is like:

.LBB9_7:                                # %vector.ph
    andi    a6, s6, 256
    vsetvli a2, zero, e32, m2, ta, ma
.LBB9_8:                                # %vector.body
    vl2re32.v   v8, (a4)
    vlse32.v    v10, (a5), s11          # s11 = 1024
    vl2re32.v   v12, (a2)
    vfmacc.vv   v12, v8, v10
    vs2r.v  v12, (a4)
    add a4, a4, s0
    add a2, a2, s0
    sub a3, a3, s9
    add a5, a5, s2
    bnez    a3, .LBB9_8

It seems that strided load/store with strides in [1024, 4096] have a worse performance. A simple probe code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define DEFINE_VLSE(LMUL)                                                      \
  __attribute__((always_inline)) void vlse_##LMUL(int *base, int stride) {     \
    __asm__("vsetvli    t0, zero, e8, " #LMUL ", ta, ma\n"                     \
            "vlse8.v    v0, (%0), %1" ::"r"(base),                             \
            "r"(stride));                                                      \
  }

DEFINE_VLSE(m1)
DEFINE_VLSE(m2)
DEFINE_VLSE(m4)
DEFINE_VLSE(m8)
DEFINE_VLSE(mf2)
DEFINE_VLSE(mf4)
DEFINE_VLSE(mf8)

int main(int argc, char **argv) {
  int stride = atoi(argv[1]);
  int times = atoi(argv[2]);

  // __attribute__((aligned(64)))
  int data[64 * stride];

#define BENCH_VLSE(LMUL)                                                       \
  {                                                                            \
    clock_t start = clock();                                                   \
    for (int i = 0; i < times; i++)                                            \
      vlse_##LMUL(data, stride);                                               \
    clock_t end = clock();                                                     \
    printf("LMUL: " #LMUL "\tstride: %d\t time: %ld\n", stride, end - start);  \
  }

  BENCH_VLSE(mf8)
  BENCH_VLSE(mf4)
  BENCH_VLSE(mf2)
  BENCH_VLSE(m1)
  BENCH_VLSE(m2)
  BENCH_VLSE(m4)
  BENCH_VLSE(m8)
}
The result is like (I highlight the abnormal results): MF8 MF4 MF2 M1 M2 M4 M8
4 38479 51332 76931 128148 230645 435399 844990
8 38521 51333 76922 128128 230579 435395 844891
16 38530 51323 76962 128129 230566 435341 845195
32 38511 51373 76932 128150 230656 435388 845083
64 38529 51322 76947 128205 230624 435417 23954097
128 38517 51338 76926 128128 230608 12351222 31148420
256 38487 51288 76945 128152 5824701 15177587 34006290
512 38526 51292 76943 2855170 7439032 16828930 35689412
1024 38511 51324 1152269 3424329 7957662 17053724 35144136
2048 38520 224200 709725 1396708 4226251 8330476 16689498
4096 38507 317053 640199 1507778 3093916 6358825 12725241
8192 38499 51349 76956 128285 1255252 2483829 4943195
16384 38525 51329 76975 128337 1255245 2484334 4975494

It's weird that we can have a better performance when stride is larger than 4096, so this issue may not be related to crossing cache-lines or pages. It may be an issue about hardware prefetcher.

So, my request is, can we add some benches of this kind of scenario?

camel-cdr commented 7 months ago

I'll look into it, this could be a new load/store benchmark under the instructions folder. I tried adding the load/store instructions to the other instruction measurements, but they didn't really fit into that framework anyways.

The behavior is indeed quite weird, but how could that be a problem with the cachelines or prefetcher? Shouldn't the CPU be easily able to hold 16 cache lines, for the m1 case. I mean, it's repeatedly accessing the same few addresses.

IIRC, you could adjust the prefetch mode in the C920, so the C908 might support that as well.

wangpc-pp commented 7 months ago

The behavior is indeed quite weird, but how could that be a problem with the cachelines or prefetcher? Shouldn't the CPU be easily able to hold 16 cache lines, for the m1 case. I mean, it's repeatedly accessing the same few addresses.

Currently, this is just a guess (the L1DCache misses increase a lot) and I have sent a feedback to t-head.