rvv-bench: XiangShan performance problems

Before start

[X] I have read the XiangShan Documents. 我已经阅读过香山文档。
[X] I have searched the previous issues and did not find anything relevant. 我已经搜索过之前的 issue，并没有找到相关的。
[X] I have searched the previous discussions and did not find anything relevant. 我已经搜索过之前的 discussions，并没有找到相关的。

Describe you problem

XiangShan performs unexpectedly bad for some cases described below.

What did you do before

There isn't much I could do, see Additional context for the full detail.

Environment

XiangShan branch: master
XiangShan commit id: 1461d8f
gcc version: 12.3.0

Additional context

Hi, I've finally got most of the code from my benchmark to run on the XiangShan rtl simulation.

While the performance is promising, XiangShan is quite slow compared to other implementations in some of the benchmarks.

You can view the results here and compare it to the C910 from XuanTie here.

The benchmarks that didn't run aren't included in the results, and I'll try to create separate issues for those once I looked at them in more detail. Build instruction are on the benchmark page, I build the DefaultConfig with DRAMsim3 from the master branch on the 2024-07-13.

Note: for future readers, once the website updates, you can still find the older results under this commit

Performance comparison to C910

Let's start with the good results, in the byteswap, LUT4 and *ascii to utf16/utf32 benchmarks XiangShan cleanly outperforms the C910 in scalar and vector as would be expected.

*On ascii to utf16/utf32 the segmented load/store implementation is a lot slower than the C910, but AFAIK the complex load/stores aren't optimized on XiangShan yet.

memcpy and memset are slow for LMUL<8

For memset, the fastest RVV implementation on XiangShan is about 2x faster than the fastest one for the C910. On memcpy the fastest XiangShan RVV implementation is actually a bit slower than the fastest C910 implementation.

Note: You can toggle the selected benchmark in the legend of the graphs by clicking on them.

However, XiangShan performs very badly with smaller LMUL, on both memcpy and memset. LMUL=1 memcpy (rvv_m1) is 5x slower on XiangShan than on the C910, and LMUL=1 memset is ~1.8x slower.

Compare the memset rvv_m1, and rvv_tail_m1 implementations, and notice that rvv_tail_m1 matches the optimal performance of rvv_m8. rvv_m1 is just a simple not unrolled, LMUL=1 vse8.v strip mining loop, rvv_tail_m1 is equivalent, but it moves the vsetvli outside the loop and only operates on vlmax inside the loop. The performance difference indicates that XiangShan currently handles vsetvli instructions very inefficiently.

strlen and utf8 count: anything involving masks is slow

I'm not sure why the RVV strlen implementations, even the one that isn't using vle8ff.v, are slower than a SWAR (musl) implementation. Both RVV implementations are about 2.5x slower than on the C910.

Similarly in utf8 count, the RVV implementation is surprisingly slow compared to C910, the C910 is >3x faster. This doesn't make much sense to me, since changing LMUL, unrolling the loop, or moving vsetvli outside the loop, don't impact performance at all, which is opposed to the observations in memset/memcpy. The only difference that could explain the performance problem that I can see is that both operate on vector masks. Maybe that introduces a weird dependency in XiangShan?

no idea why these are slow, might be a mixture of the above

The C910 outperforms XiangShan in scalar code for the mergelines 2/3 benchmark, where 2/3 characters are detected and removed. For the cases where removal is less frequent, XiangShan performs better. On the vectorized code the C910 always beats XiangShan, since the code makes heavy use of masks this is probably the explanation for that.

For mandelbrot I again have no idea what's going on in scalar. XiangShan is almost 2x slower than the C910, and only slightly faster than the X60. The vectorized versions are also about 2x slower than the C910, and even the in-order X60 that has a VLEN=256, but XiangShan should be beating both of those given its performance target. The inner loop uses multiple vsetvlis, and a vector mask, that could again be the cause for the slow performance.

XiangShan outperforms the C910 on scalar poly1305 as expected, however the vectorized implementation is once again about 2x slower than on the C910. Here, the hop loop doesn't use vector mask vector masks, nor vsetvli. It does use one vlseg4e32, but that should be overshadowed by the other vector operations.

Conclusion

Please take a look at the benchmark results your self, and maybe reproduce it for further investigation.

I think that currently, XiangShan has a big problem with handling vsetvli and operations on vector masks efficiently. This should be investigated, and once fixed it's probably better to redo the measurements, since this will have an impact on almost all vectorized implementations.

The two cases where the scalar code is slower are quite weird, especially the mandelbrot one should be investigated. I've attached the scalar assembly code for both since I used a different compiler versions to compile for the C910.

OpenXiangShan / XiangShan