Closed novatig closed 5 years ago
Today I had some time to look into this issue. The problem was in the memory buffer allocated by cufft plans.
First things first, I removed the creation of fplan_x and iplan_x. These plans are not used when I just do forward and backward 3d fft.
For example, I will run on 4 P100: srun -n 4 ./step2_gpuf 1023 1023 511
If we look at memory usage on one GPU, Memory allocated before plan creation: 805.000000 Memory allocated after fplan0 = 4925.000000 Memory allocated after iplan0 = 9017.000000 Memory allocated after fplany = 9033.000000 Memory allocated after iplany = 9049.000000 Memory allocated after fplan1 = 9051.000000 Memory allocated after fplan2 = 9563.000000 Memory allocated after 1d transp = 11101.000000
These are all MB. If I had allowed the creation of the plans in x then I would have triggered an abort due to the GPUs running out of memory. After exploring with multiple domain sizes, the amount of workspace memory claimed by cufft seems to vary a lot.
With this setup I get the following timings: L1 Error of iFF(a)-a: 1.86982 Relative L1 Error of iFF(a)-a: 1.28264e-06 Results are CORRECT! (upto single precision) GPU Timing for FFT of size 10231023511 Setup 3.04621 FFT 0.324645 IFFT 0.228852
I pushed some changes to my fork that swap all cufft calls that are batched over 2 dimensions, with batched cufft calls that are batched on one dimension only. In practice, in accfft_execute_gpu, every time I call cufftExec I have to do a for loop over the slowest index.
The result is the following: Memory allocated before plan creation: 805.000000 After fplan0: 849.000000 After iplan0: 865.000000 After fplany: 881.000000 After iplany: 897.000000 After fplan1: 899.000000 After fplan2: 901.000000 After 1d transp: 2439.000000
The timings I get are: L1 Error of iFF(a)-a: 1.86982 Relative L1 Error of iFF(a)-a: 1.28264e-06 Results are CORRECT! (upto single precision) GPU Timing for FFT of size 10231023511 Setup 3.03647 FFT 0.331879 IFFT 0.237666
A bit slower, as expected, but for most purposes it's negligible.
In the end, this way I can solve much larger problems on the same hardware. The main memory bottleneck is the memory buffer allocated by the transposition.
Hi again,
I'm not requesting a pull, but my recent changes to accfft may help many users. If you have the time you might want to have a look at my fork.
By avoiding 2d cufft plans and changing how memory is allocated for the transpose I end up reducing by a factor of 8 the memory allocated by accfft on the GPU. For example, this means that on our local machines I can double the size of the domain in each direction and not run out of memory.
Thanks for sharing this library, btw.
Hi Novatig,
I am not sure if I understood your question. We need three plans for x,y and z directions along with the corresponding backwards ones. So when you say you removed fplan_x and iplan_x, then how do you perform FFTs along those directions?
An efficient method to solve Poisson eqns with unbounded BC was proposed by Hockney and Eastwood and can be found in their book Computer Simulations using particles. At its core, the method requires padding the domain and perform a convolution in Fourier space. If the initial domain is Nx Ny Nz the padded domain has size (2Nx-1) (2Ny-1) (2Nz-1)
The problem is that, for example when using domain sizes that are powers of two, the padded domains will have awkward sizes (ie. 3 7 15 31 63 127 255 511 1023).
When I try to create plans with these domain sizes with accfft, I get unreliable results: mpirun -n 4 step1_gpuf 1024 1024 1024 output: Results are correct mpirun -n 4 step1_gpuf 1023 1023 1023 output: "UFFT error: iplan_x creation failed 2" and then Seg fault (same for 1024x1024x1023 and any permutation) mpirun -n 4 step1_gpuf 1023 1023 511 output: "UFFT error: iplan_x creation failed 2" and then Seg fault mpirun -n 4 step1_gpuf 1023 1023 255 output: Results are correct
And so on, it should be easy to replicate this error.
Would it be possible to fix accfft to support transforms on such domains?
(also, it might be related, when I run for larger domain sizes, eg 2048 1024 1024, I get seg fault. Probably due to the domain size being greater than 2^31-1 )