flame / blis

BLAS-like Library Instantiation Software Framework
Other
2.26k stars 364 forks source link

Performance testing on AMD7502 (zen2) #548

Open bartoldeman opened 3 years ago

bartoldeman commented 3 years ago

Hi,

FYI, I was curious how BLIS fares now against slightly newer versions of MKL/OpenBLAS and also AMD's fork.

Zen2

Zen2 experiment details

Zen2 results

png (inline) black=BLIS, green=AMD BLIS, red=OpenBLAS, blue=MKL

devinamatthews commented 3 years ago

@bartoldeman thanks for collecting this data, it looks great! For level-3 BLAS operations, "vanilla" BLIS and AMD BLIS should be essentially the same, with perhaps slightly better performance of AMD BLIS for small non-GEMM operations (although I think the only one not ported back so far is GEMMT which you didn't test). It seems like there might be thermal rate-limiting issues in some of these? Especially complex operations. IIRC the test driver runs from large to small problems so the dips on the multi-threaded complex gemm may be the processor lowering the frequency as it heats up. The thermal issue can also show up in other ways, e.g. if AMD BLIS is always tested after "vanilla" BLIS then it may get throttled more.

devinamatthews commented 3 years ago

@fgvanzee this gives me an idea: what if we modified the test driver to also count cycles (e.g. rdtsc on x86) and print FLOPS/cycle in addition to GFLOPs?

bartoldeman commented 3 years ago

Thanks for the feedback! Not sure what affects single-threaded *trsm (m,n,k<1000) and all-threaded zherk though from that.

I also tested on Skylake-X (Intel 6148, dual socket 2x20 cores), I'll attach the pictures but will post details later. IMHO, no big surprises here versus your tests, MKL wins overall but not everywhere, BLIS, AMD-BLIS pretty much the same everywhere, the differences look quite noisy.

1 thread l3_perf_blg_nt1 1 socket (jc2ic10jr1_nt20) l3_perf_blg_jc2ic10jr1_nt20 2 sockets (jc4ic10jr1_nt40) l3_perf_blg_jc4ic10jr1_nt40

devinamatthews commented 3 years ago

Yeah our SKX performance is not stellar. I wrote that kernel so it's totally my fault! :smile:

tlrmchlsmth commented 3 years ago

@fgvanzee this gives me an idea: what if we modified the test driver to also count cycles (e.g. rdtsc on x86) and print FLOPS/cycle in addition to GFLOPs?

rdtsc is actually a measurement of time, not clock cycles! (it returns the number of nominal clock cycles elapsed, regardless of tubo boost, throttling &c)

devinamatthews commented 3 years ago

That is an insane design decision. Isn't there another easy instruction to read the actual number of cycles elapsed? It would be a pain to have to hook into PAPI or something.

devinamatthews commented 3 years ago

Looks like you have to use the PMU, so PAPI or similar is the only portable way. Too bad.

rvdg commented 3 years ago

Great project for an undergrad?

On Sep 24, 2021, at 3:19 PM, Devin Matthews @.***> wrote:

Looks like you have to use the PMU, so PAPI or similar is the only portable way. Too bad.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/flame/blis/issues/548#issuecomment-926896641, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLLYJYHLEKKET5Z4MVSYODUDTMOJANCNFSM5EV6SM7A. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

tlrmchlsmth commented 3 years ago

Sorry for being the bearer of bad news. I don't know a better way.

BhaskarNallani commented 2 years ago

gemm, trsm, gemmt are improved on zen2/3 and will be releasing as part of upcoming AMD-BLIS Release.