camel-cdr / rvv-bench

A collection of RISC-V Vector (RVV) benchmarks to help developers write portably performant RVV code
MIT License
89 stars 13 forks source link

Question about RVV instruction throughput #13

Open zhongjuzhe opened 6 months ago

zhongjuzhe commented 6 months ago

Hi, I saw each RVV instruction throughput result here: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.html

If I want to test the execution throughput of each RVV instructions in other RISC-V board, could you give me guides ?

And I wonder whether how you measure the execution throughput ?

Thanks,

camel-cdr commented 6 months ago

Hi @zhongjuzhe, if you click on "Example measurement code for vadd.vx" you can see an example of what code I use to measure throughput.

To use this repo your self you need to:

Since the linux has disabled user level performance counter access in later versions you need to re enable them:

If you are on a more obscure platform you may need to modify ./nolibc.h to work for it.

I'll try to update the README soon, and add a wiki page for instructions on different configurations.

Please tell me if you still run into problems.

zhongjuzhe commented 6 months ago

Is is possible to run intructions/rvv in baremetal ?

I tried this following command: Clang -march=rv64gcv -O3 main.c

but failed to compile it with several undefined referenced:

undefined reference to 'bench_types'. ....

etc

camel-cdr commented 6 months ago

Yes it is, you'll have to replace the rdcycle rd instructions with csrr rd, mcycle and implement memwrite and the proper entry to main in nolibc.h.

Your command doesn't work, because you also need to preprocess (with m4) and build main.S, just look at how the Makefile does it.

I'll add some examples this weekend, including one for running baremetal on the t1 rtl simulation. That should help.

camel-cdr commented 6 months ago

I've updated the README, but didn't get to writing the wiki, because the new t1 image doesn't work as expected. I'll create it once that has been fixed.

For now, here is how I build the baremetal benchmark for it before. You probably need different compiler configuration and memwrite implementation, but this should be roughly what you need to modify for a baremetal system.

You should already have a linker configuration and entry point if you run on bare metal, so use those instead of the t1 specific ones here.

# config.mk
WARN=-Wall -Wextra -Wno-unused-function -Wno-unused-parameter
CC=clang
CFLAGS=--target=riscv32 -march=rv32gc_zve32f -mabi=ilp32 -mno-relax -static -mcmodel=medany -fvisibility=hidden -nostdlib -fno-builtin -ffreestanding -fno-PIC ${WARN} -T /t1.ld /t1_main.S -DCUSTOM_HOST  -DREAD_MCYCLE
# t1_main.S
# from: https://github.com/chipsalliance/t1/blob/master/tests/t1_main.S
.globl _start
_start:
    li a0, 0x2200 # VS&FS
    csrs mstatus, a0
    csrwi vcsr, 0
    #csrwi mcounteren,7
    li a0, -8
    csrw  mcountinhibit,a0
    #csrr a0, mcycle

    la sp, __stacktop

    // no ra to save
    call nolibc_start

    // exit
    li a0, 0x10000000
    li a1, -1
    sw a1, 4(a0)
    csrwi 0x7cc, 0

    .p2align 2
// t1.ld
// from https://github.com/chipsalliance/t1/blob/master/tests/t1.ld
OUTPUT_ARCH(riscv)
ENTRY(_start)

MEMORY {
  SCALAR (RWX) : ORIGIN = 0x20000000, LENGTH = 512M /* put first to set it as default */
  MMIO   (RW)  : ORIGIN = 0x00000000, LENGTH = 512M
  DDR    (RW)  : ORIGIN = 0x40000000, LENGTH = 2048M
  SRAM   (RW)  : ORIGIN = 0xc0000000, LENGTH = 4M /* TODO: read from config */
}

SECTIONS {
  . = ORIGIN(SCALAR);
  .text           : { *(.text .text.*) }
  . = ALIGN(0x1000);

  .data           : { *(.data .data.*) }
  . = ALIGN(0x1000);

  .sdata          : { *(.sdata .sdata.*) }
  . = ALIGN(0x1000);

  .srodata          : { *(.srodata .srodata.*) }
  . = ALIGN(0x1000);

  .bss            : { *(.bss .bss.*) }
  _end = .; PROVIDE (end = .);

  . = ORIGIN(SRAM);
  .vdata : { *(.vdata .vdata.*) } >SRAM

  .vbss (TYPE = SHT_NOBITS) : { *(.vbss .vbss.*) } >SRAM

  __stacktop = ORIGIN(SCALAR) + LENGTH(SCALAR);  /* put stack on the top of SCALAR */
  __heapbegin = ORIGIN(DDR);  /* put heap on the begin of DDR */
}
// nolibc.h
...
#ifdef CUSTOM_HOST

#define IFHOSTED(...)
#define EXIT_FAILURE 1
#define EXIT_SUCCESS 0

/* customize me */

// output to t1 uart
static void
memwrite(void const *ptr, size_t len) {
    struct uartlite_regs {
        unsigned int rx_fifo;
        unsigned int tx_fifo;
        unsigned int status;
        unsigned int control;
    };
    volatile struct uartlite_regs *const ttyUL0 = (struct uartlite_regs *)0x10000000;
    unsigned char *p = ptr;
    while (len--) {
        while (ttyUL0->status & (1<<3));
        ttyUL0->tx_fifo = *p++;
    }
}

// static size_t /* only needed for vector-utf/bench.c */
// memread(void *ptr, size_t len) { }

static void
exit(int x) { __asm volatile("unimp\n"); }

int main(void);
void nolibc_start(void) {
    int x = main();
    flush();
}

#elif __STDC_HOSTED__
...
zhongjuzhe commented 6 months ago

Is it possible to disable FP16 vector testcase ?

camel-cdr commented 6 months ago

Yes, they shouldn't be enabled default.

rvv/config.h should exclude them with the mask by default, but maybe I've missed something. Can you share after which instruction you get an illegal instruction/where the problem is?