Open wangpc-pp opened 7 months ago
I'll look into it, this could be a new load/store benchmark under the instructions folder. I tried adding the load/store instructions to the other instruction measurements, but they didn't really fit into that framework anyways.
The behavior is indeed quite weird, but how could that be a problem with the cachelines or prefetcher? Shouldn't the CPU be easily able to hold 16 cache lines, for the m1 case. I mean, it's repeatedly accessing the same few addresses.
IIRC, you could adjust the prefetch mode in the C920, so the C908 might support that as well.
The behavior is indeed quite weird, but how could that be a problem with the cachelines or prefetcher? Shouldn't the CPU be easily able to hold 16 cache lines, for the m1 case. I mean, it's repeatedly accessing the same few addresses.
Currently, this is just a guess (the L1DCache misses increase a lot) and I have sent a feedback to t-head.
Just found an issue on K230 when doing some auto-vectorization tests on https://github.com/UoB-HPC/TSVC_2. The vectorized
s1115
is like:It seems that strided load/store with strides in [1024, 4096] have a worse performance. A simple probe code:
It's weird that we can have a better performance when stride is larger than 4096, so this issue may not be related to crossing cache-lines or pages. It may be an issue about hardware prefetcher.
So, my request is, can we add some benches of this kind of scenario?