amkozlov / raxml-ng

RAxML Next Generation: faster, easier-to-use and more flexible
GNU Affero General Public License v3.0
383 stars 64 forks source link

Support AVX512F #16

Open GabeAl opened 7 years ago

GabeAl commented 7 years ago

Since raxml-ng relies on libpll, maybe this feature request is best lodged there rather than here. A quick look shows a similar feature request open since 2013 but presumably for the KNC add-in card and the conclusion was "we should not do this on the current code from the master branch." https://app.assembla.com/spaces/phylogenetic-likelihood-library/tickets/72-port-likelihood-function-xeon-phi/details#

So I think here might be the best place?

Rationale for this request is both personal and general -- personally, I have observed ExaML-KNL is up to 5x faster than NG on my KNL machine; generally, we will all soon have AVX512 anyway! :)

amkozlov commented 7 years ago

@GabeAl: you're right, it mostly concerns libpll, and we do have it on our radar anyway. Still, please feel free to open a corresponding issue in the (new) libpll repository:

https://github.com/xflouris/libpll

GrassW commented 1 year ago

May I know how exactly to compile with AVX512? eg: cmake -DENABLE_AVX512=TRUE ? Thanks.

amkozlov commented 1 year ago

Despite extensive experiments, we never managed to get reasonable speedups with AVX512 compared to AVX2. Hence, AVX512 support was never integrated to RAxML-NG.

GabeAl commented 1 year ago

This certainly made sense when this testing likely happened. There have been a few things that have changed recently, including much more RAM bandwidth (in cases where that was bottlenecking the system), and much more efficient AVX512 hardware (including more on-die resources and optimizations) which might warrant re-running some of them. I guess it's a next-next generation RAxML idea, perhaps? I know ExaML was at least 3 times faster on KNL back in the day, but since then no system has treated double-precision AVX512 as a first class citizen again until Zen 4 and IceLake/Sapphire Rapids.

https://www.phoronix.com/review/intel-sapphirerapids-avx512

The above link shows over time the profound differences (including reduced penalties and proportionately greater performance) over time for essentially the same AVX512 code. A modern 12-channel 4800MHz DDR5 Genoa system or a new 60-core Sapphire Rapids system seems to net proportionately more performance from AVX512 in double-precision floating-point math than the sad old Skylake/Cascade Lake Xeons (which were indeed bottlenecked in many ways).

amkozlov commented 1 year ago

@GabeAl thanks for the heads up!

Good to know that AMD added AVX512 support as well, and Genoa benchmark results look really impressive!

GrassW commented 1 year ago

@amkozlov @GabeAl thanks a lot, both of you. I see. I will consider AMD's CPU next time if my budget can cover them. :)