Open Artoria2e5 opened 3 years ago
If you set a proper -march
and -O3
, gcc and clang should autovectorize suitable loops for SVE, and they are pretty good at it. Try: -march=armv8-a+sve -O3
.
This is the conclusion I came to too. I mean, the same can be said for a lot of the other asm code here, and most of the difference is really due to the hard-coded elempack sizes.
The main problem is really about making the elempack stuff more flexible. And for the RISC arches, plumbing through alignment.
I think the issue is there are no real hardware to test any code you write by hand at the moment. I mean there is Fujitsu A64FX, but it is super hard to get a development system for this. Then there is an ARM simulator, but, it is slow and tedious process.
ARM SVE 和 RISC-V Vector 在精神上差不多,并且 qemu 有支持。或许可以写个简单的 layer?