LBNL-AMO-MCTDHF / V1

LBNL-AMO-MCTDHF v1.36! . . . . .
https://commons.lbl.gov/display/csd/LBNL-AMO-MCTDHF
Other
12 stars 3 forks source link

INTEL MPI 3D FFT #13

Closed djhaxton closed 8 years ago

djhaxton commented 9 years ago

Folks,

This is a double-post -- it also occurs in the TO-DO wiki, because it is a big important issue.

However surely very little work needs to be done, to get it working. It is not difficult, surely, if you know what you are doing. I don't.

I have not yet had success with this. I expect significant speedup, perhaps x3 with intel MPI 3D FFT, as opposed to the current intel 1D FFT + transpose + all-to-all, three times over, the current method. Subroutine mytranspose will not be called.

Note

COMPDIRS/notbin.ecs.hermnorm.law.mpifft

and the -D MPIFFTW flag that occurs in its Makefile.header.

djhaxton commented 9 years ago

An alternative, something that I will try first -- but I will not try this for a while, I don't think -- is to implement MPI Cooley Tukey. Cooley Tukey is the FFT algorithm in which a fourier transform of a vector NM long is decomposed into N M-long FTs which are then assembled into the NM-long FT using what are called "twiddle factors."

https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm

Doing the recursive Cooley-Tukey algorithm with MPI seems natural but this is not the way Khaled and Thorsten suggested. They suggested the method with three all-to-alls, the current method, with three batches of one-dimensional FFTs.

Getting Intel MPI 3D FFT going would be great. But we might be able to do just as well and have a code that compiles on machines without intel MPI FFT. The ultimate algorithm might speed up the code substantially. This is because a near optimal 1D MPI FFT in the z direction for the kinetic energy operator might enable the Toeplitz trick (1D) to beat the usual matvec. That would mean that MPI Cooley Tukey might speed up both the two-electron operator, the triple toeplitz matvec, and kinetic energy via single toeplitz matvec. Hypothetically, this could mean a big speedup, for huge problems. Replacing all-to-all with sendrecv seems like a great idea.

The algorithm for MPI Cooley Tukey would presumably substantially follow that in the kinetic energy matrix vector multiplication routine mult_circ_z(). N=nprocs. The twiddle factors, a matrix NM x NM x M big, are raised to successive powers each time deltaproc is incremented, they are multiplied by the local 3D FFT NM x NM x M big, then the result is communicated via sendrecv, and the results are summed to produce the local NM x NM x M block of the NM x NM x NM fourier transform.

At the very least, just with the two-electron operator, triple toeplitz matvec, I think this will give us a speedup of three, going from::

 3 x ( 1D intel fft , all-to-all permutation of indices ) 

to::

 1 x [ 3D intelfft, (N-1) x (multiply twiddle factors then sendrecv) ]

Programming MPI Cooley Tukey is on my list, but I am not going to get to it soon. At the least, it is interesting to think about!

ZBWalters commented 9 years ago

I believe the intel mkl FFT is mostly an interface to FFTW, which is free software. FFTW implements Cooley Tukey using mpi.

See http://www.fftw.org/ and http://www.fftw.org/parallel/parallel-fftw.html

Zach

On Sat, Apr 18, 2015 at 1:18 PM, Dan Haxton notifications@github.com wrote:

An alternative, something that I will try first -- but I will not try this for a while, I don't think -- is to implement MPI Cooley Tukey. Cooley Tukey is the FFT algorithm in which a fourier transform of a vector NM long is decomposed into N M-long FTs which are then assembled into the NM-long FT using what are called "twiddle factors."

https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm

Doing the recursive Cooley-Tukey algorithm with MPI seems natural but this is not the way Khaled and Thorsten suggested. They suggested the method with three all-to-alls, the current method, with three batches of one-dimensional FFTs.

Getting Intel MPI 3D FFT going would be great. But we might be able to do just as well and have a code that compiles on machines without intel MPI FFT. The ultimate algorithm might speed up the code substantially. This is because a near optimal 1D MPI FFT in the z direction for the kinetic energy operator might enable the Toeplitz trick (1D) to beat the usual matvec. That would mean that MPI Cooley Tukey might speed up both the two-electron operator, the triple toeplitz matvec, and kinetic energy via single toeplitz matvec. Hypothetically, this could mean a big speedup, for huge problems. Replacing all-to-all with sendrecv seems like a great idea.

The algorithm for MPI Cooley Tukey would presumably substantially follow that in the kinetic energy matrix vector multiplication routine mult_circ_z(). N=nprocs. The twiddle factors, a matrix NM x NM x M big, are raised to successive powers each time deltaproc is incremented, they are multiplied by the local 3D FFT NM x NM x M big, then the result is communicated via sendrecv, and the results are summed to produce the local NM x NM x M block of the NM x NM x NM fourier transform.

At the very least, just with the two-electron operator, triple toeplitz matvec, I think this will give us a speedup of three, going from::

3 x ( 1D intel fft , all-to-all permutation of indices )

to::

1 x [ 3D intelfft, (N-1) x (multiply twiddle factors then sendrecv) ]

Programming MPI Cooley Tukey is on my list, but I am not going to get to it soon. At the least, it is interesting to think about!

— Reply to this email directly or view it on GitHub https://github.com/LBNL-AMO-MCTDHF/V1.0BETA.030315.INTELFFT/issues/13#issuecomment-94198056 .

djhaxton commented 9 years ago

Hi Zach,

This is incorrect:

I believe the intel mkl FFT is mostly an interface to FFTW

The situation is as follows. There are MKL FFT routines that have complicated syntax. There are wrappers available within Intel MKL that are written in the FFTW style. These are FFTW wrappers for the MKL routines. So the situation is most close to the opposite of what you wrote.

The FFTW wrappers are easy and that is why I am using them. FFTW was standard, but people want to switch to Intel MKL without changing their code. Having the FFTW wrappers available within Intel MKL is great.

BUT. MAJOR BUT. There are functions and macros and things that need to be declared, and intel DOES NOT PROVIDE THE FORTRAN FILES TO INCLUDE IN YOUR CODE. They have to be made by the sysadmin, Yong Qin in this case. He has done this, and placed the wrappers in

[n0004 09:27:49 ~]ls $MKL_DIR/include/fftw fftw.h fftw3.f fftw3_mkl.h fftw_mpi.h rfftw_mpi.h etc.

I have copied this directory into MCTDH.SRC/FFTW_INCLUDE for those people compiling on machines with MKL, but without a sysadmin kind enough to compile the wrappers for them. I think I am including these files on NERSC but I forget.

I do not believe we, on lawrencium or NERSC, have intel MKL 3D FFT wrappers. WE NEED THEM, and we need to understand how to use them in subroutine fftw3dfftsub_mpi. To reiterate, this subroutine can be used whether or not you have MKL or FFTW installed on your machine, because of the MKL FFTW wrappers.

Zach, please do some research on this and tell me what you find! If you can find the 3D MKL FFTW wrappers, and understand how to use them, it would be a great help. Or, if we can use 3D FFTW routines, use the FFTW library not Intel, perhaps that would work just as well.

djhaxton commented 9 years ago

Argh, corrections... we do not have 3D MPI FFTW wrappers, it seems. I left out MPI.

I do not believe we, on lawrencium or NERSC, have intel MKL MPI 3D FFT wrappers.

Zach, please do some research on this and tell me what you find! If you can find the MPI 3D MKL FFTW wrappers, and understand how to use them, it would be a great help. Or, if we can use MPI 3D FFTW routines, use the FFTW library not Intel for MPI FFT, perhaps that would work just as well.

djhaxton commented 9 years ago

Folks,

I do not have a working Cooley Tukey subroutine. I would like to get the following working. My twiddle factors are wrong. I'd like to get this working by changing subroutine gettwiddle() only. Any help would be much appresh.

https://github.com/djhaxton/CooleyTukeyMPI

djhaxton commented 9 years ago

This is fascinating... I am almost there, but not quite... Somehow I translated my even-parity vector by one unit... I am trying to do an inverse fourier transform of out-of-place data... see the readme

https://github.com/djhaxton/CooleyTukeyMPI/releases/tag/v1.11

djhaxton commented 9 years ago

MPI Cooley Tukey done. Now for some timings.

https://github.com/LBNL-AMO-MCTDHF/V1/commit/f3a8eb707a1df250ed210e4348c02bf08772e244

djhaxton commented 9 years ago

See also

https://github.com/LBNL-AMO-MCTDHF/V1/commit/24350914e60c3e8272894f8e1fd7bafbdba80edd

regarding the Cooley-Tukey implementation.

djhaxton commented 9 years ago

Hi Zach,

If you could follow up on what you wrote

FFTW implements Cooley Tukey using mpi

it would be great... I am reading the link you gave,

http://www.fftw.org/parallel/parallel-fftw.html

and I am seeing a disagreement. This link seems to indicate, reading the section "Multi-dimensional parallel FFT," that FFTW implements multidimensional FFT using the all-to-all transposition of indices method, the first method that was programmed up, what Thorsten and Khaled told me to do, and what is still performing better than MPI Cooley-Tukey for the 3D FFT. I have not yet posted timings, but they are not good, for the obvious reason. From the link you gave:

"The hardest part of the multi-dimensional, distributed FFT is the transpose."

So, it seems that FFTW does it the way we have been doing it from the start per Khaled and Thorsten, with three all-to-alls and completely local one dimensional FFTs, not MPI Cooley-Tukey.

The downside to MPI Cooley-Tukey is the bulk of communication. The messages are bigger than with the 3 x (1d fft, all-to-all index permutation) method; there are the same amount of them, but they are not sent all at once. The same amount of information has to be communicated at one time, but with MPI cooley-tukey it can be just between neighbors. So I believe it is only suited to a ring topology, every node connected to two neighbors, making a circle. We have all the connections between all the processors. So maybe with some specific hardware, or with a more sophisticated MPI setup (there are things called cartesian communicators, or something, that might do the trick) MPI Cooley Tukey could beat three all-to-alls. One of us can go ask Thorsten or Khaled about this.

But for now, I am leaving the default fft_mpi_inplaceflag=1. This invokes the original all-to-all method, whereas fft_mpi_inplaceflag=0 selects the new MPI Cooley Tukey subroutines (which do an out-of-place fourier transform and out-of-place inverse). With faster communication, the code could run perhaps ten times faster for large calculations, in which the communication does appear to be taking about 90% of the time. But perhaps we will yet iron out some more issues related to heaps and stacks and process affinity and NUMA or not NUMA and whatnot that come into play in an OpenMP/MPI implementation, and get the factor of five speedup that will be satisfactory, when communication and computation are equally balanced, making MPI Cooley-Tukey or Intel MPI FFT superfluous, at which time we can close this issue.

djhaxton commented 9 years ago

Oh... more fundamentally... since MPI Cooley Tukey only has to communicate with neighbors, does it possess what I have been calling "strong scaling"?

Checking wikipedia, I see that I might have a problem with nomenclature, with this "strong scaling." I have been using the term to denote the situation in which, when the size of the problem per processor is fixed and the number of processors increases linearly with the total size (here n^3, n is numpoints, points on a side) of the problem, communication never takes more time than computation.

With an all-to-all method, this is not likely to be the case. Because of the number of different processors that must be reached by any given processor (nprocs-1) at one time, the communication always overwhelms computation when the number of processors gets large.

But with mpi Cooley-Tukey, done right, I wonder, does the fact that each processor only needs to communicate with two others at any one time mean that the number of processors can be increased arbitrarily?

In other words, does 3D MPI cooley tukey have what I have been calling "strong scaling", whereas the 3 x all-to-all method for 3d FFT does not? If so, does intel MKL figure this out when it plans the 3D MPI FFT? Does FFTW figure this out? (FFTW_EXHAUSTIVE)

Now, my problem with nomenclature:

http://en.wikipedia.org/wiki/Scalability#Weak_versus_strong_scaling

What I am calling "strong scaling" because I thought it's what Thorsten calls "strong scaling" is what wikipedia would call favorable "weak scaling" in terms of communication vs computation.

djhaxton commented 9 years ago

MPI Cooley-Tukey with MPI recursion is installed and working, and will be tested for v1.03 tag. Toepflag=2 is not working with it yet, and might not be for v1.03. The kinetic energy operator with toepflag=2 is hardwired for this MPI Cooley-tukey option. Perhaps it won't give a speedup, and toepflag=2 will be deprecated, but I'm going to give toepflag=2 an honest try with this new method.

With toepflag=1 or 2, select MPI Cooley-Tukey, not the all-to-all method, for triple toeplitz matvec (two electron operator) with fft_mpi_inplaceflag=0, not 1 which is the all-to-all default.

Consider 3^5=243 processors, with four steps or five levels of Cooley-Tukey MPI recursion.

Each processor sends its whole block to two other processors, five times.

There is a greater bulk of communication relative to three all-to-alls (10:3). But that bulk is sent in a total of 10 x 243 messages, not 3 x 242 x 243.

This has got to win in some situations... certainly small primitive dvr basis size, large number of processors... (good for coming 1D version) but I am hoping that it will work in the large dvr basis limit too, avoiding the stochastic behavior that I have observed when attempting to pass many large messages on lawrencium: here is a 399 point calculation on methane

F.T. 2mpi dot ALL ft_copy ft_conjg ft_ft1d ft_tr ft_mpi
310470 2605 2724 325025 3749 3422 54034 25186 202093 277780 2568 2910 292468 3697 3373 53085 25033 170697 663402 2387 2921 677923 3698 3345 53037 24972 556445 1043979 2284 2920 1058177 3703 3351 53169 25022 936763 229926 2391 2930 244503 3486 3355 53360 25017 122745

djhaxton commented 9 years ago

Okay. Here is the situation on lawrencium. It appears that, using small values (zero or one, NOT TWO) for fft_circbatchdim fft_batchdim... just make them equal, choose 1 for usual or 2 for small calculations -- the all-to-all method (fft_mpi_inplaceflag=1) can perform as advertised, without stochastic screwups like those for which there is evidence immediately above. And, on lawrencium, when things are working fine, the time taken in communication is roughly proportional to the bulk of communication required. In other words, there is a certain bandwidth on lawrencium, and a certain processor speed, and with the fourier transform done in parallel, it means that most of the time taken to do the fourier transform is communication. And because, when things are running well which for large calculations means fft_batchdims = 0 or 1, the time is proportional to the bulk of communication, the number of messages doesn't matter, and all-to-all (fft_mpi_inplaceflag=1) beats MPI Cooley Tukey (fft_mpi_inplaceflag=0). With 128 points, 7 layers of MPI Cooley Tukey, the ratio in terms of bulk of communication is 7:3 for Cooley-Tukey to all-to-all. For the moment, all-to-all appears to beat MPI Cooley-Tukey both in the small and large primitive basis size limit, but I bet we can turn the tables in the small basis size limit, the limit in which the size of the messages is irrelevant, and only their number matters, here 7 x 128 versus 3 x 128 x 127 Cooley-Tukey vs all-to-all. Here are timings for cubane hartree fock, 28 orbitals. It is close: a lot closer than 7:3.

timing.neutral.alltoall/twoematel.time.dat <== F.T. 2mpi dot ALL ft_copy ft_conjg ft_ft1d ft_tr ft_mpi ft_copy ft_circ 19024 195 10412 30434 315 510 1655 3114 11134 1971 323 17085 261 10392 28536 301 477 2009 3358 8605 2001 332 17006 329 10427 28569 303 441 2434 3334 8281 1888 323 17002 330 10412 28556 292 442 2422 3312 8313 1887 326

==> timing.neutral.ct/twoematel.time.dat <== F.T. 2mpi dot ALL ft_conjg ft_twid ft_mpict ft_raise ft_tmult ft_ft3d ft_circ 31631 150 10189 42933 0 0 0 0 0 0 323 32341 153 10161 43536 0 0 0 0 0 0 326 32579 151 10147 43676 0 0 0 0 0 0 322 32534 148 10136 43634 0 0 0 0 0 0 327 31593 147 10171 42760 0 0 0 0 0 0 332