long double and quad precision tests with --enable-amd-opt are failing

migueldiascosta commented 5 years ago

they pass without --enable-amd-opt for all precisions, and they pass with --enable-amd-opt for float and double, but they fail with --enable-amd-opt for long double and quad precision, are these not supported?

(This is on a server with AMD EPYC 7601 CPUs and using Easybuild's FFTW easyblock, https://github.com/easybuilders/easybuild-easyblocks/blob/master/easybuild/easyblocks/f/fftw.py, on the sources from https://github.com/amd/amd-fftw/archive/2.0.tar.gz)

migueldiascosta commented 5 years ago

on the bright side, I do see an average ~15% performance boost using the gearshifft benchmark, I hope you're planning to contribute these back to FFTW

pradeeptrgit commented 5 years ago

Hi Miguel Dias Costa,

Thanks for the feedback. The --enable-amd-opt option currently works only for float and double and has not been tested for other datatypes. We missed mentioning this in the documentation. Apologies for the same. We will fix this soon.

Regarding the gearshift benchmark, it nice to know that you are seeing improved performance. It would be great if you can provide more details on the type of FFTs and sizes where you are observing the improvement. We will also try to run the benchmark from our end and analyze the results.

migueldiascosta commented 5 years ago

@pradeeptrgit thanks for the clarification on the long double and quad precisions

Regarding the benchmarks, I used the default gearshifft tests (I'll submit a PR with the AMD FFTW variant and gearshifft using it to the easybuild easyconfigs repository soon).

Those were single threaded tests, using 2 threads the average speedup is a bit higher, ~20%, and then goes down to ~10% for 4 threads, but for threaded runs there are other factors (e.g., processor affinity, numa effects) and I'm not as interested in those (mostly use one thread per MPI process of whatever application)

These were also without --enable-amd-trans, the tests fail for me if I use it, I'll open a separate issue on that

migueldiascosta commented 5 years ago

https://github.com/easybuilders/easybuild-easyconfigs/pull/8783

BiplabRaut commented 5 years ago

Hi Miguel Dias Costa, Thank you for providing details on configuration of the benchmark tests performed by you. Regarding the flag “--enable-amd-trans”, I will provide more details in my reply under the separate issue reported by you.

However, let me put across few important details of this AMD optimized FFTW library (amd-fftw).

The improvements released with amd-fftw can be enabled by the option "--enable-amd-opt".
The improvements made under "--enable-amd-opt" are only tested and supported for float and double data types as of now. We will support the same improvements for long double and quad very soon in the coming months.
The option "--enable-amd-trans" is an optional improvement feature that may be enabled for very large sizes. But, this feature is currently only supported under single-threaded mode. There is a known issue when using it in threaded/Hybrid mode. We suggest not to use the "--enable-amd-trans" option for your multi-threaded or Hybrid benchmarking modes. Please use only "--enable-amd-opt" option, which is the main AMD performance switch, for threaded/Hybrid modes.

BiplabRaut commented 4 years ago

on the bright side, I do see an average ~15% performance boost using the gearshifft benchmark, I hope you're planning to contribute these back to FFTW

Hi Miguel Dias Costa, Can you tell us which test dataset you have run? We would like to repeat and reproduce the test results on our machines. Since there are multiple "extents" files in gearshifft, we want to know which one is widely used and referred? Secondly, gearshifft seems to log its output at the completion of all the test cases. Is there any configuration in gearshifft to enable output after completion of each individual's test. (This allows us to stop the gearshifft tests in between when it takes long time without loosing the results so far)

migueldiascosta commented 4 years ago

@BiplabRaut my initial tests were with gearshifft's "default fallback" extents, which are indeed to small. The speedups are more modest as the size increases (and there is a large variation...)

Regarding the long runs, what I did was loop (e.g. in bash) over the extents (e.g., in one of the provided files) in order to have a separate run and output per extent (using the -e and -o arguments)

I'm afraid I don't have any particular insight on gearshifft or how it is widely used - you may want to refer to https://arxiv.org/abs/1702.00629, if you haven't already

BiplabRaut commented 4 years ago

Hi Miguel Dias Costa, AMD-FFTW 2.1 is released with more optimizations. Can you please repeat your gearshifft tests with this new release and let us know your results.

Thank you.

migueldiascosta commented 4 years ago

@BiplabRaut I'll likely only look at this more carefully when we get our Rome nodes, but in a quick run on Naples using gearshifft's extents_small.conf I got an average speedup of 20% when using AMD-FFTW 2.1 (sequential) compared to FFTW 3.3.8 - again, I hope this optimizations make it upstream

BiplabRaut commented 4 years ago

Hi Miguel Dias Costa, Thank you for checking our 2.1 release and sharing the new Naples results. We look forward to your Rome results - hope you would be soon having your Rome nodes.

Thank you.

amd / amd-fftw

long double and quad precision tests with --enable-amd-opt are failing #1