amd / amd-fftw

FFTW code optimized for AMD based processors
GNU General Public License v2.0
47 stars 10 forks source link

Any plan for pre-optimized wisdom files for AMD cpus? #4

Closed ahnitz closed 3 years ago

ahnitz commented 4 years ago

It would be a huge convenience if there were some pre-optimized wisdom files (targeting a range of FFT configurations / sizes) for each AMD cpu. I do scientific computing that requires very fast FFTs and in my experience the fft training is extremely important (sometimes I've seen 2x speedup from going from plan level 1 to 3), but often very costly (in wall time) to do properly.

BiplabRaut commented 4 years ago

Dear Alex, Thank you for the suggestion. We are actually working on a few scientific and HPC applications to know their common sizes. We will make available the pre-optimized wisdom files for such sizes in the near future.

However, it would be great to know more about the applications you work on. I could see that you are the lead developer of PyCBC. We would be happy to see AMD-FFTW making a big difference to your applications.

Can you let us know more details of your application:- 1) What are the FFT problem sizes used in your application? 2) Is the FFT computed in single-precision or double-precision data type? 3) Is the FFT performed in-place or out-of-place? 4) Is the FFT computation done in single-threaded Or multi-threaded Or MPI Or hybrid (MPI+openMP)? Let us know the number of threads or/and MPI ranks used. 5) What are your test system configuration:- AMD CPU model, OS, Compiler, and MPI framework details?

Thank you. S. Biplab Raut

ahnitz commented 4 years ago

Everything we do is single-precision, out-of-place transforms, and we don't use MPI for FFTs. We have a couple applications which differ somewhat in the FFTs they use.

Offline analysis: this is our basic archival search for gravitational waves. One of its more expensive operations is an FFT and the entire analysis embarrassingly parallel, so our target is generally overall machine throughput.

low-latency analysis: This is the system used for rapid identification of gravitational waves. The FFTs sizes more varied here and we use both openmp and the batch FFT interface.

Details of our computing

Site 1) CPU: EPYC 7452 in dual socket configurations OS: Debian Buster compiler: gcc 8.3.0

BiplabRaut commented 4 years ago

Dear Alex, Thank you for providing information about your applications and test system.

I would like to know few more things as mentioned below. 1) I understand that only 1D FFTs are used by your program? Are the input to FFT complex or real? 2) For both “Offline analysis” and “low-latency analysis”, I understand that multiple instances of the program are run based on available cores? Do you run with hyperthreading(SMT) on or off? 3) In case of “low-latency analysis”, as per your example, would the 4 threads perform the batch of 30 FFTs of size 15 * 2^14? 4) Can you tell us FFT’s run-time share in the total wall-time for the case of “Offline analysis” and “low-latency analysis”? 5) I hope both FFTW and MKL are supported in your software. Can you tell us the current performance difference between MKL and FFTW with your software?

Thank you.

ahnitz commented 4 years ago

1) Yes, these are all 1d FFTs. I should have clarified they are actually inverse ffts only. Always complex -> complex transforms 2) We have been running with hyperthreading enabled. 3) Yes, that's right. Though, I imagine this might not be easy to optimize as there are really many (similar) sizes used at the same time. The possible permutations are somewhat large. This is one reason we've generally stuck with MKL for this application in production.
4) Under ideal conditions, for "offline" it is ~ 70 % of the compute time. For low-latency it is ~ 80%. 5) Yes, indeed we support both MKL and FFTW. MKL is generally faster for us for the "offline" analysis by ~ 30%, though I am comparing to a planning level "measure" here (my initial testing with amd-fftw didn't yield much improvement moving to "patient", and I haven't completed an "exhaustive" run yet to benchmark it. For the low latency analysis, I don't have recent numbers on AMD cpus, but I can say we generally found comparable performance on the older Intel cpus we did previous production analyses with between MKL and FFTW using 'patient' wisdom files. It was more convenient to use MKL in this case, as it allowed more flexibility in tuning the FFT sizes the analysis uses.

BiplabRaut commented 4 years ago

Dear Alex, Thank you for all the details. Appreciate it.

We want to run PyCBC by ourselves to evaluate and benchmark it with pre-optimized wisdom files of amd-fftw. So, please guide us with the detailed steps on installation/setup, compilation and running of PyCBC and its associated application for both "offline and low latency analysis".

In all our standalone tests so far, amd-fftw has a superior performance for the sizes 2^18 - 2^22. Is the planning (training) time being included in your application's total compute time? FFTW planner's PATIENT mode is generally known to generate significantly optimized plans than MEASURE mode. How are you setting and changing these planner related modes?

Can you also share details on the MKL version used by you? This will help in our benchmarking tests of PyCBC.

Thank you.

ahnitz commented 4 years ago

@BiplabRaut I'll get back to you with some detailed instructions to run the offline executable. It might be a few days before I can get to that though, but I'm very interested to see what you all come up with.

BiplabRaut commented 3 years ago

Dear Alex, Did you get a chance to look into this? Any initial steps on one of the two analysis modes can also get us started on this.

Thank you.

BiplabRaut commented 3 years ago

Since no response has been received, so closing this issue. Please feel free to contact us in case you want to pursue this issue again.