camel-cdr / rvv-bench-results

A collection of RISC-V Vector (RVV) benchmarks to help developers write portably performant RVV code. (Results)
20 stars 2 forks source link

Result for Banana Pi BPI-F3 #1

Open Glavo opened 1 month ago

Glavo commented 1 month ago
        #####           glavo@k1
       #######          --------
       ##O#O##          OS: Bianbu 1.0rc1 riscv64
       #######          Host: spacemit k1-x deb1 board
     ###########        Kernel: 6.1.15
    #############       Uptime: 1 min
   ###############      Packages: 782 (dpkg)
   ################     Shell: fish 3.6.1
  #################     Terminal: /dev/pts/0
#####################   CPU: Spacemit X60 (8) @ 1.600GHz
#####################   Memory: 187MiB / 3807MiB
  #################

out.json

I noticed the results are weird, does anyone know what could be the reason for this?

camel-cdr commented 1 month ago

Looks like the kernel doesn't expose rdcycle. I think that was changed in recent kernels, and I have to look into how to best access it via the perf api. Thanks for your help. I'll get my BPI-F3 in a few days, and I also need to fix the instruction cycle count benchmark, as I"ve learned that processors that don't predict vl have a dependency on the destination register.

camel-cdr commented 1 month ago

@Glavo My BPI-F3 actually arrived today, so I was able to test a few things.

Apparently the kernel disabled rdcycle userspace access, but since kernel version 6.5 you can re-enable that using the perf_user_access sysctl, see: https://lwn.net/Articles/939436/

The BPI-F3 image however is on an older kernel. On this kernel you can enable rdcycle access by enabling the PERF_COUNT_HW_CPU_CYCLES perf event (see SO post).

Using the perf event API directly would probably be cleaner, however I need to support bare metal as well, so I think I'll keep to code for now, but provide instructions on how to run it on different kernel versions.

For kernel version <6.5, I'll add a small utility program that can be used to start a process with user-space rdcycle enabled, via the perf_event_attr.inherit flag.

I still need to rewrite the instruction cycle count benchmark, once that's done I'll upload the measurements. The performance looks quite good so far.

MarekPikula commented 1 month ago

Another option is to disable PMU handling in the kernel alotogether. I'm currently testing PULP Ara on FPGA, and I had to disable CONFIG_RISCV_PMU in kernel. Then kernel doesn't "own" the PMU, thus enabling applications to directly issue rdcycles.

I think that you might also need to disable the PMU handler in OpenSBI, as it might disable the cycle counter by default (I think it happened to me, but I don't have enough time to reproduce it).

This, of course, prevents you form accessing perf in other places, but to run the benchmark alone, it shouldn't be a problem.

camel-cdr commented 1 month ago

@MarekPikula The README now has an overview on how to do enable the counters on different kernel versions, but that could be another method.

Does ara work for you? I had a lot of trouble with it when I tried it. I've been following the code chainges since, or rather the lack there of. From what I can tell this hasn't been fixes yet, but it may also only occur on verilator.

Also: How big of an fpga is needed to run it?

MarekPikula commented 1 month ago

Yeah, I tried the ENABLE_RDCYCLE_HACK approach, but it didn't work (i.e., it crashed with a kernel error – I should have a log somewhere, but I can't find it now). I'm running Ara under FireSim with a basic Buildroot [FireMarshal](https://github.com/firesim/FireMarshal] image with 6.2 kernel. There's no reason not to upgrade to something newer, as there are no custom patches (besides two out-of-tree modules for block device and network), but I wanted to have as few moving parts as possible for the initial tests.

Regarding issues with Ara, indeed, it seems somewhat buggy. I tried to run rvv-bench tests on it, but after a few failed benches (either freeze or illegal instruction error), I let go. Right now, I'm running an instruction test to have at least a glimpse into the cycle performance of different instructions. Even on FPGA, it's running rather slowly (80 MHz is the fmax in my configuration), so maybe I'll have some results tomorrow. Once I have anything of value, I'll open a PR with results so far.

I'm running it on AWS EC2 F1 instance with FireSim (so Xilinx VU9P) and the complete design (including the AWS wrapper and FireSim stuff) takes 31% LUTs, 12% FFs, 19% RAMB36, 5% of URAMs and 2% of DSP blocks, so it's not that bad :stuck_out_tongue: But, granted, it's a pretty beefy FPGA. I have it configured in the most default, 2-lane, 2048 VLEN configuration (so 64b AXI, with no need for width conversion and such), but I'm planning to try to build it in some other configurations as well.

Besides, I'll be presenting a poster about this project at the upcoming RISC-V Summit Europe this month (title: Accelerating software development for emerging ISA extensions with cloud-based FPGAs: RVV case study).

camel-cdr commented 1 month ago

@MarekPikula

Yeah, I tried the ENABLE_RDCYCLE_HACK approach, but it didn't work

Interesting, I'll add your option to the README.

Once I have anything of value, I'll open a PR with results so far. Right now, I'm running an instruction test to have at least a glimpse into the cycle performance of different instructions

Sounds like it runs for you now, but if it doesn't, try commenting out the call randomize in rvv/main.S, that seemed to help me simulate on XiangShan, although it was to slow to do a full run.

Besides, I'll be presenting a poster about this project at the upcoming RISC-V Summit Europe this month (title: Accelerating software development for emerging ISA extensions with cloud-based FPGAs: RVV case study).

Oh, great, I guess well meet then. I'll also present a poster, right next to yours coincidentally: "Accelerating Unicode Conversions using the RISC-V Vector Extension". So we are poster buddies ^w^

mp-17 commented 1 week ago

Hello @MarekPikula and @camel-cdr,

I am now dedicating some time every week to fixing issues in Ara. If you want, we can schedule a brief call to discuss them. Let me know if you are interested :-)