flame / blis

BLAS-like Library Instantiation Software Framework
Other
2.28k stars 365 forks source link

Poor DGEMM performance for armsve build on Neoverse N2 #641

Open chrisgoodyer opened 2 years ago

chrisgoodyer commented 2 years ago

Hi.

Whilst doing some comparative benchmarking on the Alibaba Cloud g8m instances I've run into some BLIS performance issues. g8m is based on Arm's Neoverse N2 technology and has 2x128-bit SVE vectors.

When I've done a build for the target "armsve" I am getting a peak performance of between 5 and 6 GFLOPs on a single core rather than the 20 GFLOPs I get from the Neon implementation.

There seems to be an awful lot of time spent in the function "bli_dpackm_mrxk_armsve_ref" which makes me think it is packing incorrectly for the 128-bit vector length. Running on AWS Graviton3 instances (with a 256-bit vector length) does not show these issues.

Thanks.

Chris

devinamatthews commented 2 years ago

I think, of the currently-available configs, that ThunderX2 should perform best on N2. The SVE kernels are tuned for 256+ bit so I think you really want a neon kernel. A "real" Neoverse N1 kernel/configuration should be in master shortly.

jlinford commented 2 years ago

Good to hear about the N1 kernel coming to master. I also suggest building a 4x128 NEON kernel on the Neoverse V1 (AWS Graviton3). For GEMM, I don't see SVE128 having a significant advantage over NEON128. If you build a kernel that can feed four NEON SIMD units it should run very well on all known Arm server-class CPUs, even if they don't have wide SVE units.

jdiamondGitHub commented 2 years ago

FWIW, the existing Neon kernel we have was designed originally for dual issue Neon, but manages to issue to four 128-bit Neon units on the Apple M1 series cores and achieves over 99% of peak flops.  It seems that cores designed for issuing to 4 pipes scale up older code pretty well, although I haven't tested a G3 yet. :)

On 7/8/22 2:01 PM, John C. Linford wrote:

Good to hear about the N1 kernel coming to master. I also suggest building a 4x128 NEON kernel on the Neoverse V1 (AWS Graviton3). For GEMM, I don't see SVE128 having a significant advantage over NEON128. If you build a kernel that can feed four NEON SIMD units it should run very well on all known Arm server-class CPUs, even if they don't have wide SVE units.

— Reply to this email directly, view it on GitHub https://github.com/flame/blis/issues/641#issuecomment-1179281382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHGHJKYSQ3CYVI5BEBHKWNTVTB3JTANCNFSM53BO64IA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

xrq-phys commented 1 year ago

Apologies for this late response.

For Graviton 3, 2xSVE256 does better than 4xNEON by about 2% or so.

armsve is not suitable for 128-bit due to its lack of indexed FMA that would decrease assembly capacity for instruction latency, but 5~6 GFLOPS is unexpected (should be ~15.). A possible reason here is that your Neoverse N2 core does not implement hardware prefetching which is presumed for kernels/armsve. I do not know how Alibaba Cloud differs from like Amazon C7g and Oracle Ampere, but using NEON ones should be good for your machine.