MegEngine / MegPeak

Apache License 2.0
247 stars 38 forks source link

LoongArch: Add LoongArch support #29

Closed XiWeiGu closed 1 year ago

XiWeiGu commented 1 year ago

Obtained the following data on Loongson 3A5000:

there are 4 cores, currently use core id :0

bandwidth: 12.974849 Gbps
xvld throughput: 0.201758 ns 39.651489 GFlops latency: 0.200846 ns :
xvst throughput: 0.401607 ns 19.919973 GFlops latency: 0.401008 ns :
xvldx throughput: 0.401515 ns 19.924532 GFlops latency: 0.403114 ns :
xvstx throughput: 0.401397 ns 19.930397 GFlops latency: 0.401173 ns :
xvldrepl.b throughput: 0.200757 ns 39.849121 GFlops latency: 0.200332 ns :
xvldrepl.h throughput: 0.201017 ns 39.797703 GFlops latency: 0.200278 ns :
xvldrepl.w throughput: 0.200713 ns 39.857933 GFlops latency: 0.204638 ns :
xvldrepl.d throughput: 0.203766 ns 39.260719 GFlops latency: 0.200469 ns :
xvstelm.b throughput: 0.402616 ns 19.870066 GFlops latency: 0.401276 ns :
xvstelm.h throughput: 0.406103 ns 19.699427 GFlops latency: 0.401022 ns :
xvstelm.w throughput: 0.402913 ns 19.855394 GFlops latency: 0.402339 ns :
xvstelm.d throughput: 0.403451 ns 19.828911 GFlops latency: 0.401863 ns :
xvfmadd.d throughput: 0.200378 ns 39.924519 GFlops latency: 2.017237 ns :

Unfortunately, I haven't come up with a good method to measure the latency of memory access instructions.

CLAassistant commented 1 year ago

CLA assistant check
All committers have signed the CLA.

chenqy4933 commented 1 year ago

👍

chenqy4933 commented 1 year ago

I don‘t have a LoongArch, can u paste some megpeak results, very curious about its performance

chenqy4933 commented 1 year ago

LGTM

XiWeiGu commented 1 year ago

Test environment

Architecture: loongarch64 Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Model name: Loongson-3A5000HV CPU family: Loongson-64bit Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 BogoMIPS: 5000.00 Flags: cpucfg lam ual fpu lsx lasx complex crypto lvz lbt_x86 lbt_arm lbt_mips Caches (sum of all): L1d: 256 KiB (4 instances) L1i: 256 KiB (4 instances) L2: 1 MiB (4 instances) L3: 16 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-3

Test results

there are 4 cores, currently use core id :0

bandwidth: 14.390625 Gbps xvld throughput: 0.200999 ns 39.801094 GFlops latency: 0.200839 ns : xvst throughput: 0.400933 ns 19.953476 GFlops latency: 0.755138 ns : xvldx throughput: 0.400810 ns 19.959572 GFlops latency: 0.400648 ns : xvstx throughput: 0.400989 ns 19.950676 GFlops latency: 0.401052 ns : xvldrepl.b throughput: 0.200373 ns 39.925491 GFlops latency: 0.204004 ns : xvldrepl.h throughput: 0.201367 ns 39.728470 GFlops latency: 0.200366 ns : xvldrepl.w throughput: 0.220293 ns 36.315331 GFlops latency: 0.200159 ns : xvldrepl.d throughput: 0.202074 ns 39.589371 GFlops latency: 0.200383 ns : xvstelm.b throughput: 0.402302 ns 19.885559 GFlops latency: 0.401266 ns : xvstelm.h throughput: 0.402501 ns 19.875740 GFlops latency: 0.401449 ns : xvstelm.w throughput: 0.400831 ns 19.958517 GFlops latency: 0.436825 ns : xvstelm.d throughput: 0.402018 ns 19.899622 GFlops latency: 0.400972 ns : xvfadd.s throughput: 0.200156 ns 39.968861 GFlops latency: 2.009491 ns : xvfadd.d throughput: 0.200160 ns 19.984056 GFlops latency: 2.016037 ns : xvfsub.s throughput: 0.200219 ns 39.956272 GFlops latency: 2.009061 ns : xvfsub.d throughput: 0.200226 ns 19.977432 GFlops latency: 2.009032 ns : xvfadd.s throughput: 0.207230 ns 38.604401 GFlops latency: 2.009103 ns : xvfadd.d throughput: 0.200218 ns 19.978233 GFlops latency: 2.010268 ns : xvfmul.s throughput: 0.200244 ns 39.951244 GFlops latency: 2.016110 ns : xvfmul.d throughput: 0.200240 ns 19.975992 GFlops latency: 2.008815 ns : xvfdiv.s throughput: 2.416477 ns 3.310605 GFlops latency: 4.426337 ns : xvfdiv.d throughput: 1.810069 ns 2.209861 GFlops latency: 3.217365 ns : xvfmadd.s throughput: 0.200183 ns 79.926743 GFlops latency: 2.008960 ns : xvfmadd.d throughput: 0.207318 ns 38.588062 GFlops latency: 2.008897 ns : xvfmsub.s throughput: 0.200228 ns 79.909088 GFlops latency: 2.009030 ns : xvfmsub.d throughput: 0.207286 ns 38.593990 GFlops latency: 2.009614 ns : xvfnmadd.s throughput: 0.200218 ns 119.869255 GFlops latency: 2.009093 ns : xvfnmadd.d throughput: 0.200299 ns 59.910290 GFlops latency: 2.016122 ns : xvfnmsub.s throughput: 0.200186 ns 119.888283 GFlops latency: 2.008883 ns : xvfnmsub.d throughput: 0.200372 ns 59.888592 GFlops latency: 2.021650 ns : xvfmax.s throughput: 0.200176 ns 39.964741 GFlops latency: 0.800971 ns : xvfmax.d throughput: 0.200248 ns 19.975222 GFlops latency: 0.807695 ns : xvfmin.s throughput: 0.200313 ns 39.937408 GFlops latency: 0.807854 ns : xvfmin.d throughput: 0.200143 ns 19.985729 GFlops latency: 0.801426 ns : xvfmaxa.s throughput: 0.200231 ns 39.953815 GFlops latency: 0.807750 ns : xvfmaxa.d throughput: 0.200206 ns 19.979404 GFlops latency: 0.808101 ns : xvfmina.s throughput: 0.200171 ns 39.965832 GFlops latency: 0.800680 ns : xvfmina.d throughput: 0.200221 ns 19.977919 GFlops latency: 0.807889 ns : xvflogb.s throughput: 0.200209 ns 39.958298 GFlops latency: 1.608566 ns : xvflogb.d throughput: 0.200250 ns 19.975023 GFlops latency: 1.608510 ns : xvfclass.s throughput: 0.207207 ns 38.608650 GFlops latency: 0.800772 ns : xvfclass.d throughput: 0.200166 ns 19.983364 GFlops latency: 0.807992 ns : xvfsqrt.s throughput: 2.409462 ns 3.320243 GFlops latency: 4.425754 ns : xvfsqrt.d throughput: 1.808659 ns 2.211583 GFlops latency: 3.219351 ns : xvfrecip.s throughput: 2.417677 ns 3.308962 GFlops latency: 4.427438 ns : xvfrecip.d throughput: 1.808760 ns 2.211459 GFlops latency: 3.217133 ns : xvfrsqrt.s throughput: 3.624846 ns 4.413981 GFlops latency: 6.834527 ns : xvfrsqrt.d throughput: 2.416098 ns 3.311124 GFlops latency: 4.424533 ns :