Add benches for strided load/store with different strides

wangpc-pp commented 7 months ago

Just found an issue on K230 when doing some auto-vectorization tests on https://github.com/UoB-HPC/TSVC_2. The vectorized s1115 is like:

.LBB9_7:                                # %vector.ph
    andi    a6, s6, 256
    vsetvli a2, zero, e32, m2, ta, ma
.LBB9_8:                                # %vector.body
    vl2re32.v   v8, (a4)
    vlse32.v    v10, (a5), s11          # s11 = 1024
    vl2re32.v   v12, (a2)
    vfmacc.vv   v12, v8, v10
    vs2r.v  v12, (a4)
    add a4, a4, s0
    add a2, a2, s0
    sub a3, a3, s9
    add a5, a5, s2
    bnez    a3, .LBB9_8

It seems that strided load/store with strides in [1024, 4096] have a worse performance. A simple probe code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define DEFINE_VLSE(LMUL)                                                      \
  __attribute__((always_inline)) void vlse_##LMUL(int *base, int stride) {     \
    __asm__("vsetvli    t0, zero, e8, " #LMUL ", ta, ma\n"                     \
            "vlse8.v    v0, (%0), %1" ::"r"(base),                             \
            "r"(stride));                                                      \
  }

DEFINE_VLSE(m1)
DEFINE_VLSE(m2)
DEFINE_VLSE(m4)
DEFINE_VLSE(m8)
DEFINE_VLSE(mf2)
DEFINE_VLSE(mf4)
DEFINE_VLSE(mf8)

int main(int argc, char **argv) {
  int stride = atoi(argv[1]);
  int times = atoi(argv[2]);

  // __attribute__((aligned(64)))
  int data[64 * stride];

#define BENCH_VLSE(LMUL)                                                       \
  {                                                                            \
    clock_t start = clock();                                                   \
    for (int i = 0; i < times; i++)                                            \
      vlse_##LMUL(data, stride);                                               \
    clock_t end = clock();                                                     \
    printf("LMUL: " #LMUL "\tstride: %d\t time: %ld\n", stride, end - start);  \
  }

  BENCH_VLSE(mf8)
  BENCH_VLSE(mf4)
  BENCH_VLSE(mf2)
  BENCH_VLSE(m1)
  BENCH_VLSE(m2)
  BENCH_VLSE(m4)
  BENCH_VLSE(m8)
}

The result is like (I highlight the abnormal results):		MF8	MF4	MF2	M1	M2	M4
4	38479	51332	76931	128148	230645	435399	844990
8	38521	51333	76922	128128	230579	435395	844891
16	38530	51323	76962	128129	230566	435341	845195
32	38511	51373	76932	128150	230656	435388	845083
64	38529	51322	76947	128205	230624	435417	23954097
128	38517	51338	76926	128128	230608	12351222	31148420
256	38487	51288	76945	128152	5824701	15177587	34006290
512	38526	51292	76943	2855170	7439032	16828930	35689412
1024	38511	51324	1152269	3424329	7957662	17053724	35144136
2048	38520	224200	709725	1396708	4226251	8330476	16689498
4096	38507	317053	640199	1507778	3093916	6358825	12725241
8192	38499	51349	76956	128285	1255252	2483829	4943195
16384	38525	51329	76975	128337	1255245	2484334	4975494

It's weird that we can have a better performance when stride is larger than 4096, so this issue may not be related to crossing cache-lines or pages. It may be an issue about hardware prefetcher.

So, my request is, can we add some benches of this kind of scenario?

camel-cdr commented 7 months ago

I'll look into it, this could be a new load/store benchmark under the instructions folder. I tried adding the load/store instructions to the other instruction measurements, but they didn't really fit into that framework anyways.

The behavior is indeed quite weird, but how could that be a problem with the cachelines or prefetcher? Shouldn't the CPU be easily able to hold 16 cache lines, for the m1 case. I mean, it's repeatedly accessing the same few addresses.

IIRC, you could adjust the prefetch mode in the C920, so the C908 might support that as well.

wangpc-pp commented 7 months ago

The behavior is indeed quite weird, but how could that be a problem with the cachelines or prefetcher? Shouldn't the CPU be easily able to hold 16 cache lines, for the m1 case. I mean, it's repeatedly accessing the same few addresses.

Currently, this is just a guess (the L1DCache misses increase a lot) and I have sent a feedback to t-head.

camel-cdr / rvv-bench

Add benches for strided load/store with different strides #12