OpenXiangShan / GEM5

BSD 3-Clause "New" or "Revised" License
67 stars 25 forks source link

GEMM kernel complied by AM execute incorrectly in XS-GEM5 #183

Open DCliuzhe opened 1 month ago

DCliuzhe commented 1 month ago

I have implemented a GEMM kerenl using RVV and complie it into a bare metal using AM. Before simulation, I deleted the function calls that were not aligned with the RTL and depended on the vector destination register fake data in the issue_queue.cc file. However, the output matrix elements in GEM5 simulation results are all 0, which are expected to 128. I also simulated it on the original GEM5, and the result was correct. My source code are as followed:

#include <riscv_vector.h>
#include <stdio.h>
#include <stdlib.h>
#include <klib.h>

void matmul(float *a, float *b, float *c, int M, int N, int K) {
  for (int i = 0; i < M; ++i) {
    for (int j = 0; j < N; ++j) {
        int k = 0;
        c[i * N + j] = 0;
        for(size_t vl; k < K; ){
            vl = __riscv_vsetvl_e32m1(K - k);  //动态获取向量长度
            vfloat32m1_t vec_a = __riscv_vle32_v_f32m1(&a[i * K + k], vl);  
            vfloat32m1_t vec_b = __riscv_vle32_v_f32m1(&b[j * K + k], vl);  //加载a向量和b向量
            vfloat32m1_t vec_s = __riscv_vfmul_vv_f32m1(vec_a, vec_b, vl); //做向量点乘
            vfloat32m1_t vsum = __riscv_vfredusum_vs_f32m1_f32m1(vec_s, __riscv_vfmv_s_f_f32m1(0.0f, vl), vl); //进行向量规约和
            float sum = __riscv_vfmv_f_s_f32m1_f32(vsum); //获得部分和结果
            printf("sum : %f \n", sum);
            c[i * N + j] += sum;
            k += vl;
        }
    }
  }
}

int main() {
    int M = 16;
    int N = 16;
    int K = 64;

    float *a = (float*)malloc(M * K * sizeof(float));
    float *b = (float*)malloc(N * K * sizeof(float));
    float *c = (float*)malloc(M * N * sizeof(float));

    for(int i = 0; i < M * K; i ++)
        a[i] = 1.0f;

    for(int i = 0; i < N * K; i ++)
        b[i] = 2.0f;

    matmul(a, b, c, M, N, K);

    for(int i = 0; i < M * N; i ++)
        printf("%f ", c[i]);

    return 0;
}

Through staged debugging, I determined that the problem occurred in the step of vector reduction. How can I fix it?

DCliuzhe commented 1 month ago

I opened difftest and tested it, and found that GEM5 made an error when executing the vfredusum instruction. GEM5 split it into two micro_ops for execution. The result was completely determined by the second micro_op, and the wrong result was obtained. I would like to ask why it was split into two micro_ops and the execution was wrong.

heap start = 8000a000
build/RISCV/cpu/base.cc:970: warn: Inst [sn:18228] pc: 0x800001b8, msg: [sn:18228 pc:0x800001b8] vfredusum_vs_micro v24, v25,vtmp0, res: 0000000000000000_0000000000000000
build/RISCV/cpu/base.cc:972: warn: May be diff at v24
 Ref  value: ffffffffffffffff_ffffffff41000000
 GEM5 value: 0000000000000000_0000000000000000
build/RISCV/cpu/base.hh:699: warn: In CPU0: NEMU PC: 0x800001b8, GEM5 PC: 0x800001b8, inst: vfredusum_vs_micro v24, v25,vtmp0
  $0: 0x0000000000000000   ra: 0x0000000080000268   sp: 0x0000000080009fa0   gp: 0x0000000000000000 
  tp: 0x0000000000000000   t0: 0x0000000000000000   t1: 0x000000008000a000   t2: 0x0000000000000000 
  s0: 0x0000000000000000   s1: 0x0000000000000010   a0: 0x000000008000c000   a1: 0x0000000000000040 
  a2: 0x000000008000a000   a3: 0x000000008000b000   a4: 0x0000000000000000   a5: 0x0000000000000004 
  a6: 0x0000000000000000   a7: 0x0000000000000000   s2: 0x000000008000c000   s3: 0x0000000000000010 
  s4: 0x0000000000000010   s5: 0x0000000000000000   s6: 0x0000000000000000   s7: 0x0000000000000000 
  s8: 0x0000000000000000   s9: 0x0000000000000000  s10: 0x0000000000000000  s11: 0x0000000000000000 
  t3: 0x000000008000b000   t4: 0x0000000000000000   t5: 0x000000008000c040   t6: 0x0000000000000040 
 ft0: 0xffffffff00000000  ft1: 0xffffffff00000000  ft2: 0xffffffff00000000  ft3: 0xffffffff00000000 
 ft4: 0xffffffff00000000  ft5: 0xffffffff00000000  ft6: 0xffffffff00000000  ft7: 0xffffffff00000000 
 fs0: 0xffffffff00000000  fs1: 0xffffffff00000000  fa0: 0xffffffff00000000  fa1: 0xffffffff00000000 
 fa2: 0xffffffff00000000  fa3: 0xffffffff00000000  fa4: 0xffffffff00000000  fa5: 0xffffffff00000000 
 fa6: 0xffffffff00000000  fa7: 0xffffffff00000000  fs2: 0xffffffff00000000  fs3: 0xffffffff00000000 
 fs4: 0xffffffff00000000  fs5: 0xffffffff00000000  fs6: 0xffffffff00000000  fs7: 0xffffffff00000000 
 fs8: 0xffffffff00000000  fs9: 0xffffffff00000000 fs10: 0xffffffff00000000 fs11: 0xffffffff00000000 
 ft8: 0xffffffff00000000  ft9: 0xffffffff00000000 ft10: 0xffffffff00000000 ft11: 0xffffffff00000000 
pc: 0x00000000800001bc mstatus: 0x8000000a00006600 mcause: 0x0000000000000000 mepc: 0x0000000000000000
                       sstatus: 0x8000000200006600 scause: 0x0000000000000000 sepc: 0x0000000000000000
satp: 0x0000000000000000
mip: 0x0000000000000000 mie: 0x0000000000000000 mscratch: 0x0000000000000000 sscratch: 0x0000000000000000
mideleg: 0x0000000000000000 medeleg: 0x0000000000000000
mtval: 0x0000000000000000 stval: 0x0000000000000000 mtvec: 0x0000000000000000 stvec: 0x0000000000000000
fcsr: 0x0000000000000000
privilege mode:3
pmp: 16 entries active, details:
 0: cfg:0x00 addr:0x0000000000000000| 1: cfg:0x00 addr:0x0000000000000000
 2: cfg:0x00 addr:0x0000000000000000| 3: cfg:0x00 addr:0x0000000000000000
 4: cfg:0x00 addr:0x0000000000000000| 5: cfg:0x00 addr:0x0000000000000000
 6: cfg:0x00 addr:0x0000000000000000| 7: cfg:0x00 addr:0x0000000000000000
 8: cfg:0x00 addr:0x0000000000000000| 9: cfg:0x00 addr:0x0000000000000000
10: cfg:0x00 addr:0x0000000000000000|11: cfg:0x00 addr:0x0000000000000000
12: cfg:0x00 addr:0x0000000000000000|13: cfg:0x00 addr:0x0000000000000000
14: cfg:0x00 addr:0x0000000000000000|15: cfg:0x00 addr:0x0000000000000000
v0 : 0x0000000000000000_0000000000000000  v1 : 0x0000000000000000_0000000000000000  
v2 : 0x0000000000000000_0000000000000000  v3 : 0x0000000000000000_0000000000000000  
v4 : 0x0000000000000000_0000000000000000  v5 : 0x0000000000000000_0000000000000000  
v6 : 0x0000000000000000_0000000000000000  v7 : 0x0000000000000000_0000000000000000  
v8 : 0x0000000000000000_0000000000000000  v9 : 0x0000000000000000_0000000000000000  
v10: 0x0000000000000000_0000000000000000  v11: 0x0000000000000000_0000000000000000  
v12: 0x0000000000000000_0000000000000000  v13: 0x0000000000000000_0000000000000000  
v14: 0x0000000000000000_0000000000000000  v15: 0x0000000000000000_0000000000000000  
v16: 0x0000000000000000_0000000000000000  v17: 0x0000000000000000_0000000000000000  
v18: 0x0000000000000000_0000000000000000  v19: 0x0000000000000000_0000000000000000  
v20: 0x0000000000000000_0000000000000000  v21: 0x0000000000000000_0000000000000000  
v22: 0x0000000000000000_0000000000000000  v23: 0x0000000000000000_0000000000000000  
v24: 0xffffffffffffffff_ffffffff41000000  v25: 0x0000000000000000_0000000000000000  
v26: 0x4000000040000000_4000000040000000  v27: 0x0000000000000000_0000000000000000  
v28: 0x0000000000000000_0000000000000000  v29: 0x0000000000000000_0000000000000000  
v30: 0x0000000000000000_0000000000000000  v31: 0x0000000000000000_0000000000000000  
vtype: 0x00000000000000d0 vstart: 0x0000000000000000 vxsat: 0x0000000000000000
vxrm: 0x0000000000000000 vl: 0x0000000000000004 vcsr: 0x0000000000000000
build/RISCV/cpu/base.cc:1334: warn: gem5-rRegsDisplay : 
  $0 :                0   ra :         80000268   sp :         80009fa0   gp :                0 
  tp :                0   t0 :                0   t1 :         8000a000   t2 :                0 
  s0 :                0   s1 :               10   a0 :         8000c000   a1 :               40 
  a2 :         8000a000   a3 :         8000b000   a4 :                0   a5 :                4 
  a6 :                0   a7 :                0   s2 :         8000c000   s3 :               10 
  s4 :               10   s5 :                0   s6 :                0   s7 :                0 
  s8 :                0   s9 :                0  s10 :                0  s11 :                0 
  t3 :         8000b000   t4 :                0   t5 :         8000c040   t6 :               40 
build/RISCV/cpu/base.cc:1347: warn: gem5-fRegsDisplay : 
 ft0 : ffffffff00000000  ft1 : ffffffff00000000  ft2 : ffffffff00000000  ft3 : ffffffff00000000 
 ft4 : ffffffff00000000  ft5 : ffffffff00000000  ft6 : ffffffff00000000  ft7 : ffffffff00000000 
 fs0 : ffffffff00000000  fs1 : ffffffff00000000  fa0 : ffffffff00000000  fa1 : ffffffff00000000 
 fa2 : ffffffff00000000  fa3 : ffffffff00000000  fa4 : ffffffff00000000  fa5 : ffffffff00000000 
 fa6 : ffffffff00000000  fa7 : ffffffff00000000  fs2 : ffffffff00000000  fs3 : ffffffff00000000 
 fs4 : ffffffff00000000  fs5 : ffffffff00000000  fs6 : ffffffff00000000  fs7 : ffffffff00000000 
 fs8 : ffffffff00000000  fs9 : ffffffff00000000 fs10 : ffffffff00000000 fs11 : ffffffff00000000 
 ft8 : ffffffff00000000  ft9 : ffffffff00000000 ft10 : ffffffff00000000 ft11 : ffffffff00000000 
build/RISCV/cpu/base.cc:1398: warn: gem5-CsrDisplay : 
pc :         800001b8      mstatus :        a00000000 mcause :                0 mepc    :                0
                           sstatus :        200000000 scause :                0 sepc    :                0
satp    :                0
mip     :                0 mie     :                0 mscratch:                0 sscratch:                0
mideleg :                0 medeleg :                0
mtval   :                0 stval   :                0 mtvec   :                0 stvec   :                0
privilege mode : 3
build/RISCV/cpu/base.cc:1423: warn: gem5-VectorDisplay : 
v00 : 0000000000000000_0000000000000000 v01 : 0000000000000000_0000000000000000
v02 : 0000000000000000_0000000000000000 v03 : 0000000000000000_0000000000000000
v04 : 0000000000000000_0000000000000000 v05 : 0000000000000000_0000000000000000
v06 : 0000000000000000_0000000000000000 v07 : 0000000000000000_0000000000000000
v08 : 0000000000000000_0000000000000000 v09 : 0000000000000000_0000000000000000
v10 : 0000000000000000_0000000000000000 v11 : 0000000000000000_0000000000000000
v12 : 0000000000000000_0000000000000000 v13 : 0000000000000000_0000000000000000
v14 : 0000000000000000_0000000000000000 v15 : 0000000000000000_0000000000000000
v16 : 0000000000000000_0000000000000000 v17 : 0000000000000000_0000000000000000
v18 : 0000000000000000_0000000000000000 v19 : 0000000000000000_0000000000000000
v20 : 0000000000000000_0000000000000000 v21 : 0000000000000000_0000000000000000
v22 : 0000000000000000_0000000000000000 v23 : 0000000000000000_0000000000000000
v24 : 0000000000000000_0000000000000000 v25 : 0000000000000000_0000000000000000
v26 : 4000000040000000_4000000040000000 v27 : 0000000000000000_0000000000000000
v28 : 0000000000000000_0000000000000000 v29 : 0000000000000000_0000000000000000
v30 : 0000000000000000_0000000000000000 v31 : 0000000000000000_0000000000000000
vtype   :               d0 vstart   :                0  vxsat   :                0
vxrm    :                0 vl       :                4  vcsr    :                0

build/RISCV/cpu/base.hh:702: warn: start dump last 20 committed msg
build/RISCV/cpu/base.hh:705: warn: V [sn:18198 pc:0x8000017c] c_li t4, 0, res: 0
build/RISCV/cpu/base.hh:705: warn: V [sn:18199 pc:0x8000017e] fsw fa3, 0(a0), paddr: 0x8000c000
build/RISCV/cpu/base.hh:705: warn: V [sn:18200 pc:0x80000182] bge zero, a1, 92
build/RISCV/cpu/base.hh:705: warn: V [sn:18212 pc:0x80000186] flw fa5, 0(a0), res: 0xffffffff00000000, paddr: 0x8000c000
build/RISCV/cpu/base.hh:705: warn: V [sn:18213 pc:0x8000018a] addiw a6, t4, 0, res: 0
build/RISCV/cpu/base.hh:705: warn: V [sn:18214 pc:0x8000018e] c_li a4, 0, res: 0
build/RISCV/cpu/base.hh:705: warn: V [sn:18215 pc:0x80000190] addw a2, a7, a4, res: 0
build/RISCV/cpu/base.hh:705: warn: V [sn:18216 pc:0x80000194] addw a3, a6, a4, res: 0
build/RISCV/cpu/base.hh:705: warn: V [sn:18217 pc:0x80000198] c_slli a2, 2, res: 0
build/RISCV/cpu/base.hh:705: warn: V [sn:18218 pc:0x8000019a] c_slli a3, 2, res: 0
build/RISCV/cpu/base.hh:705: warn: V [sn:18219 pc:0x8000019c] subw a5, a1, a4, res: 0x40
build/RISCV/cpu/base.hh:705: warn: V [sn:18220 pc:0x800001a0] c_add a2, t1, res: 0x8000a000
build/RISCV/cpu/base.hh:705: warn: V [sn:18221 pc:0x800001a2] c_add a3, t3, res: 0x8000b000
build/RISCV/cpu/base.hh:705: warn: V [sn:18222 pc:0x800001a4] vsetvli a5, a5, e32, m1, ta, ma, res: 0x4
build/RISCV/cpu/base.hh:705: warn: V [sn:18223 pc:0x800001a8] vle32_v_micro v24, 0(a2), zero, res: 3f8000003f800000_3f8000003f800000, paddr: 0x8000a000
build/RISCV/cpu/base.hh:705: warn: V [sn:18224 pc:0x800001ac] vle32_v_micro v26, 0(a3), zero, res: 4000000040000000_4000000040000000, paddr: 0x8000b000
build/RISCV/cpu/base.hh:705: warn: V [sn:18225 pc:0x800001b0] vmv_v_i_micro v25, v0, 0, res: 0000000000000000_0000000000000000
build/RISCV/cpu/base.hh:705: warn: V [sn:18226 pc:0x800001b4] vfmul_vv_micro v24, v24, v26, res: 4000000040000000_4000000040000000
build/RISCV/cpu/base.hh:705: warn: V [sn:18227 pc:0x800001b8] vfredusum_vs_micro v24, v24, res: 0000000000000000_0000000041200000
build/RISCV/cpu/base.hh:705: warn: V [sn:18228 pc:0x800001b8] vfredusum_vs_micro v24, v25,vtmp0, res: 0000000000000000_0000000000000000
build/RISCV/cpu/base.cc:1304: panic: Difftest failed!
tastynoob commented 1 month ago

It seem like vectorReduceFloatFormat has bug, because of add vectorOldVDElim May you avoid using vector reduce inst type?

DCliuzhe commented 1 month ago

Thanks for your reply, I will try not to use vectorReduce next time. Please mention it in your commit logs after you fix the bug.