Open Glavo opened 6 months ago
Looks like the kernel doesn't expose rdcycle
. I think that was changed in recent kernels, and I have to look into how to best access it via the perf api.
Thanks for your help.
I'll get my BPI-F3 in a few days, and I also need to fix the instruction cycle count benchmark, as I"ve learned that processors that don't predict vl have a dependency on the destination register.
@Glavo My BPI-F3 actually arrived today, so I was able to test a few things.
Apparently the kernel disabled rdcycle
userspace access, but since kernel version 6.5 you can re-enable that using the perf_user_access
sysctl
, see: https://lwn.net/Articles/939436/
The BPI-F3 image however is on an older kernel. On this kernel you can enable rdcycle
access by enabling the PERF_COUNT_HW_CPU_CYCLES
perf event (see SO post).
Using the perf event API directly would probably be cleaner, however I need to support bare metal as well, so I think I'll keep to code for now, but provide instructions on how to run it on different kernel versions.
For kernel version <6.5, I'll add a small utility program that can be used to start a process with user-space rdcycle
enabled, via the perf_event_attr.inherit
flag.
I still need to rewrite the instruction cycle count benchmark, once that's done I'll upload the measurements. The performance looks quite good so far.
Another option is to disable PMU handling in the kernel alotogether. I'm currently testing PULP Ara on FPGA, and I had to disable CONFIG_RISCV_PMU
in kernel. Then kernel doesn't "own" the PMU, thus enabling applications to directly issue rdcycles
.
I think that you might also need to disable the PMU handler in OpenSBI, as it might disable the cycle counter by default (I think it happened to me, but I don't have enough time to reproduce it).
This, of course, prevents you form accessing perf in other places, but to run the benchmark alone, it shouldn't be a problem.
@MarekPikula The README now has an overview on how to do enable the counters on different kernel versions, but that could be another method.
Does ara work for you? I had a lot of trouble with it when I tried it. I've been following the code chainges since, or rather the lack there of. From what I can tell this hasn't been fixes yet, but it may also only occur on verilator.
Also: How big of an fpga is needed to run it?
Yeah, I tried the ENABLE_RDCYCLE_HACK
approach, but it didn't work (i.e., it crashed with a kernel error – I should have a log somewhere, but I can't find it now). I'm running Ara under FireSim with a basic Buildroot [FireMarshal](https://github.com/firesim/FireMarshal] image with 6.2 kernel. There's no reason not to upgrade to something newer, as there are no custom patches (besides two out-of-tree modules for block device and network), but I wanted to have as few moving parts as possible for the initial tests.
Regarding issues with Ara, indeed, it seems somewhat buggy. I tried to run rvv-bench tests on it, but after a few failed benches (either freeze or illegal instruction error), I let go. Right now, I'm running an instruction test to have at least a glimpse into the cycle performance of different instructions. Even on FPGA, it's running rather slowly (80 MHz is the fmax in my configuration), so maybe I'll have some results tomorrow. Once I have anything of value, I'll open a PR with results so far.
I'm running it on AWS EC2 F1 instance with FireSim (so Xilinx VU9P) and the complete design (including the AWS wrapper and FireSim stuff) takes 31% LUTs, 12% FFs, 19% RAMB36, 5% of URAMs and 2% of DSP blocks, so it's not that bad :stuck_out_tongue: But, granted, it's a pretty beefy FPGA. I have it configured in the most default, 2-lane, 2048 VLEN configuration (so 64b AXI, with no need for width conversion and such), but I'm planning to try to build it in some other configurations as well.
Besides, I'll be presenting a poster about this project at the upcoming RISC-V Summit Europe this month (title: Accelerating software development for emerging ISA extensions with cloud-based FPGAs: RVV case study).
@MarekPikula
Yeah, I tried the ENABLE_RDCYCLE_HACK approach, but it didn't work
Interesting, I'll add your option to the README.
Once I have anything of value, I'll open a PR with results so far. Right now, I'm running an instruction test to have at least a glimpse into the cycle performance of different instructions
Sounds like it runs for you now, but if it doesn't, try commenting out the call randomize
in rvv/main.S, that seemed to help me simulate on XiangShan, although it was to slow to do a full run.
Besides, I'll be presenting a poster about this project at the upcoming RISC-V Summit Europe this month (title: Accelerating software development for emerging ISA extensions with cloud-based FPGAs: RVV case study).
Oh, great, I guess well meet then. I'll also present a poster, right next to yours coincidentally: "Accelerating Unicode Conversions using the RISC-V Vector Extension". So we are poster buddies ^w^
Hello @MarekPikula and @camel-cdr,
I am now dedicating some time every week to fixing issues in Ara. If you want, we can schedule a brief call to discuss them. Let me know if you are interested :-)
Hi @mp-17, sorry for the late reply. You can find my poster from RV Summit and benchmark results here: https://github.com/MarekPikula/RISC-V-Summit-Europe-2024
This week, I'm planning to revisit my setup, rebase onto the latest Ara sources, and see what has changed. I'll keep you posted :smiley:
BTW, I'm coming to this year's ORConf, and this time, I will give a full talk on the same topic as on RV Summit, but hopefully, this time with better results :wink:
@MarekPikula
BTW, I got RVV on XiangShan working ona specific commit: https://github.com/OpenXiangShan/XiangShan/issues/3200
AFAIK the vsetvli performance should be better now, but I couldn't test it, because the simulation started hanging again.
There is also now another open source RVV implementation: https://github.com/ucb-bar/saturn-vectors
Last time I tried most things worked but some didn't (e.g. strlen), I still have to report that, but I'm quite occupied this month.
out.json
I noticed the results are weird, does anyone know what could be the reason for this?