Zhao-Dongyu / sgemm_riscv

This project records the process of optimizing SGEMM (single-precision floating point General Matrix Multiplication) on the riscv platform.
MIT License
13 stars 0 forks source link

Measurements from C920 core #1

Open camel-cdr opened 6 months ago

camel-cdr commented 6 months ago

Hi, I just read your lovely article.

I tried running the measurements on the more powerful C920 cores, and everything worked out of the box. :+1:

Here are the results:

result

sgemm_riscv.csv

Zhao-Dongyu commented 6 months ago

I haven’t verified it on other platforms yet, so happy to see your results!

There are still some optimizations that can be done, and we look forward to higher performance!

Javipove commented 2 months ago

Which system and hardware had been used for this testing? I imagine the SG2042 with a Fedora? Thanks for sharing the csv.

camel-cdr commented 2 months ago

@Javipove I ran it on the SG2042 server from perfXlab. I'm not sure which exact distro was on the system, but iirc it was debian based.

Javipove commented 2 months ago

@Zhao-Dongyu Hello, I am also trying to run the bandwidth test for the floating and vector versions in the SG2042 but I don't really understand how the math is done to calculate it. Where does the first number that you substract the result of the test program comes from? Thanks (and sorry if its a dumb question)

Zhao-Dongyu commented 2 months ago

@Zhao-Dongyu Hello, I am also trying to run the bandwidth test for the floating and vector versions in the SG2042 but I don't really understand how the math is done to calculate it. Where does the first number that you substract the result of the test program comes from? Thanks (and sorry if its a dumb question)

This is a good question. I did not explain it in detail in the article and code because I wrote this part of the code too messy...

The memory bandwidth test is explained here: https://github.com/Zhao-Dongyu/sgemm_riscv/tree/main/prepare (Sorry, I did not write the English version here. I used a lot of Chinese, which makes it difficult for you to read.)

For example, I used the flw method to test, and the result was: flw: 4000MB/(2678.298 − 1171.154)ms = 2.592GB/s You must be wondering where the number 1171.154 comes from?

截屏2024-06-11 上午10 26 34

In this file, you can see that this test code contains not only flw instructions, but also addi, slti, and bnez instructions. I don’t think these instructions are doing memory transfer work, so I commented out all the flw instructions in the assembly code and ran it again, and got the number 1171.154. In fact, I am not sure whether this calculation method is correct and whether it conforms to the underlying logic of the computer? If you have a better answer, remember to call me.