eth-cscs / SpFFT

Sparse 3D FFT library with MPI, OpenMP, CUDA and ROCm support
BSD 3-Clause "New" or "Revised" License
48 stars 11 forks source link

Problems to call spfft_multi_transform_* routines #58

Closed fstein93 closed 3 months ago

fstein93 commented 3 months ago

Dear developers, I am currently attempting to use SpFFT (version 1.0.6) to offload FFTs to GPUs in a Fortran code (CP2K). In our code, there is the need to perform several FFTs at once, usually 1D and 2D FFTs which I map to 3D FFTs. I am able to employ the routines spfft_transform_forward and spfft_transform_backward with something like

Now, I am migrating to the spfft_multi_transform_*routines. Currently, I am only trying the CPU version for testing. A boiled down version of my new code for the forward transform looks as followed

! Input:
INTEGER, INTENT(IN) :: number_of_ffts
INTEGER(3), INTENT(IN) :: fft_dimension
COMPLEX(8), DIMENSION(PRODUCT(fft_dimension)*number_of_ffts), TARGET :: input, output

! Variables for FFTs with SpFFT
TYPE(C_PTR), DIMENSION(:), ALLOCATABLE, TARGET :: spfft_transform_multi, outputPointers
INTEGER(C_INT), DIMENSION(:), ALLOCATABLE, TARGET :: spFFTLocations
INTEGER :: number_of_threads
INTEGER(C_INT) :: errorCode
INTEGER, DIMENSION(:), ALLOCATABLE :: indices
TYPE(C_PTR) :: spfft_ptr
COMPLEX(C_DOUBLE_COMPLEX), DIMENSION(:), POINTER :: complex_ptr

! Setup index maps in indices
...

! Determine number_of_threads
...

! Setup array of transformations
ALLOCATE(spfft_transform_multi(0:number_of_ffts-1)
DO i = 0, plan%m - 1
  errorcode = spfft_transform_create_independent(spfft_transform_multi(i), number_of_threads, SPFFT_PU_HOST, &
                                                           SPFFT_TRANS_C2C, fft_dimension(3), fft_dimension(2), fft_dimension(1), &
                                                           PRODUCT(fft_dimension), SPFFT_INDEX_TRIPLETS, indices)
  IF (errorcode /= SPFFT_SUCCESS) CPABORT("Error in SPFFT!")
END DO

DO i = 0, number_of_ffts- 1
  ! Collect pointers to first elements of the output
  outputPointers(i) = C_LOC(output(i*number_of_elements_to_be_transformed))

  ! set space domain array to use memory allocted by the library
  errorCode = spfft_transform_get_space_domain(spfft_transform_multi(i), spfftLocations(i), spfft_ptr)
  IF (errorCode /= SPFFT_SUCCESS) ERROR STOP

  CALL C_F_POINTER(spfft_ptr, complex_ptr, fft_dimension)

  ! the input array has to be transposed first
  complex_ptr(:) = input(i::number_of_ffts)
END DO

! Perform actual FFT
! Crash occurs here inside of the routine
errorCode = spfft_multi_transform_forward(number_of_ffts , C_LOC(spfft_transform_multi), &
                                                    C_LOC(spfftLocations), C_LOC(outputPointers), C_LOC(spfftScalings)
IF (errorCode /= SPFFT_SUCCESS) ERROR STOP

I compiled the code turning on several debug flags with GCC13 and I obtain a segfault in SpFFT (spfft.F is the file in which I am actually working)

LeakSanitizer:DEADLYSIGNAL ==358529==ERROR: LeakSanitizer: SEGV on unknown address 0x000000000060 (pc 0x0000037714d0 bp 0x7ffe17ffeba0 sp 0x7ffe17ffeb68 T0) ==358529==The signal is caused by a READ memory access. ==358529==Hint: address points to the zero page. LeakSanitizer:DEADLYSIGNAL ==358530==ERROR: LeakSanitizer: SEGV on unknown address 0x000000000060 (pc 0x0000037714d0 bp 0x7ffcd8239f90 sp 0x7ffcd8239f58 T0) ==358530==The signal is caused by a READ memory access. ==358530==Hint: address points to the zero page.

0 0x37714d0 in spfft::ExecutionHost::space_domain_data() /home/fstein/cp2k/cp2k/tools/toolchain/build/SpFFT-1.0.6/src/execution/execution_host.cpp:356

#⁠⁠1 0x376a6bc in spfft::TransformInternal<double>::space_domain_data(SpfftProcessingUnitType) /home/fstein/cp2k/cp2k/tools/toolchain/build/SpFFT-1.0.6/src/spfft/transform_internal.cpp:355
#⁠2 0x376b45b in spfft::MultiTransformInternal<spfft::Transform>::forward(int, spfft::Transform*, SpfftProcessingUnitType const*, double* const*, SpfftScalingType const*) /home/fstein/cp2k/cp2k/tools/toolchain/build/SpFFT-1.0.6/src/spfft/multi_transform_internal.hpp:56
#⁠3 0x376b45b in spfft::multi_transform_forward(int, spfft::Transform*, SpfftProcessingUnitType const*, double* const*, SpfftScalingType const*) /home/fstein/cp2k/cp2k/tools/toolchain/build/SpFFT-1.0.6/src/spfft/multi_transform.cpp:40
#⁠4 0x376b45b in spfft_multi_transform_forward /home/fstein/cp2k/cp2k/tools/toolchain/build/SpFFT-1.0.6/src/spfft/multi_transform.cpp:72
#⁠5 0x30a8f89 in __spfft_lib_MOD_spfft_2d /home/fstein/cp2k/cp2k/src/pw/fft/spfft_lib.F:317
#⁠6 0x30a22b2 in __fft_lib_MOD_fft_2d /home/fstein/cp2k/cp2k/src/pw/fft/fft_lib.F:355
#⁠7 0x2f6bd0f in __fft_tools_MOD_fft3d_ps /home/fstein/cp2k/cp2k/src/pw/fft_tools.F:801
#⁠8 0x2fdc7da in __pw_methods_MOD_fft_wrap_pw1pw2_r3d_c1d_rs_gs /home/fstein/cp2k/cp2k/src/pw/pw_methods.F:1395
#⁠9 0xb03c39 in __qs_collocate_density_MOD_calculate_rho_core_c1d_gs /home/fstein/cp2k/cp2k/src/qs_collocate_density.F:916
#⁠10 0x17dc6e7 in __qs_update_s_mstruct_MOD_qs_env_update_s_mstruct /home/fstein/cp2k/cp2k/src/qs_update_s_mstruct.F:104
#⁠11 0xa68b99 in qs_energies_init_hamiltonians /home/fstein/cp2k/cp2k/src/qs_energy_init.F:316
#⁠12 0xa694e0 in __qs_energy_init_MOD_qs_energies_init /home/fstein/cp2k/cp2k/src/qs_energy_init.F:111
#⁠13 0xb9968d in __qs_energy_MOD_qs_energies /home/fstein/cp2k/cp2k/src/qs_energy.F:95
#⁠14 0x1e35a86 in qs_forces /home/fstein/cp2k/cp2k/src/qs_force.F:200
#⁠15 0x1e39bb1 in __qs_force_MOD_qs_calc_energy_force /home/fstein/cp2k/cp2k/src/qs_force.F:112
#⁠16 0xfbe5df in __force_env_methods_MOD_force_env_calc_energy_force /home/fstein/cp2k/cp2k/src/force_env_methods.F:259
#⁠17 0xb3b328 in __f77_interface_MOD_calc_energy_force /home/fstein/cp2k/cp2k/src/f77_interface.F:1355
#⁠18 0x6ae3d0 in cp2k_calc_energy_force /home/fstein/cp2k/cp2k/src/start/libcp2k.F:377
#⁠19 0x6ace7d in main /home/fstein/cp2k/cp2k/src/start/libcp2k_unittest.c:79
#⁠20 0x7fab4927cd8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
#⁠21 0x7fab4927ce3f in __libc_start_main_impl ../csu/libc-start.c:392
#⁠22 0x6ac954 in _start (/home/fstein/cp2k/cp2k/exe/local/libcp2k_unittest.pdbg+0x6ac954)```

I am running with two MPI ranks and a single thread per rank.

The actual code is found at here for the transformation creation and here for the actual transformation. The preceding commit where I used a loop over spfft_transform_forward worked in the CPU version.

I have also tried variants of the calls to the transformation routine in which I indexed the first element directly (C_LOC(pointer(0)) instead of C_LOC(pointer)) or where I dropped the C_LOC function where possible. I haven even added my own interface to the C routine where I used arrays instead of plain TYPE(C_PTR).

Do you have an idea what is going on or do you need more information?

AdhocMan commented 3 months ago

I've looked into the issue and found at least one bug in the C and Fortran interface of the multi-transform functions. It's ultimately an internal pointer issue when converting from C / Fortran transform pointers to C++. This fix is in #59 and has been merged into the develop branch. There could be other issues, but hopefully it will work for you.

Some general notes for your use case: It looks like you are computing dense 2D transforms. Using SpFFT for this use case will come with some overhead, as it really hasn't been designed for it. You might be better off writing a thin wrapper around cuFFT for memory transfer to / from host memory. Also, FFTs are typically so fast, that copying to / from device memory can easily outweigh any benefit you might get from using GPUs. So if your input and output are not already located in device memory, I'd expect there to be little benefit unless you have very large input sizes.

fstein93 commented 3 months ago

Thank you. I will try it later even if it may not lead to an acceleration. Regarding our use case, I was attempting to accelerate local FFTs first. Several 1D and all 2D FFTs will become obsolete in favor of distributed 3D FFTs and sparse FFTs.

fstein93 commented 3 months ago

I have just tried it locally on CPU and the tests pass.

AdhocMan commented 3 months ago

Great, Thanks for reporting the issue!

In case it helps, here are some details about the multi-transform implementation: It's designed to achieve better performance through overlap of asynchronous operations. This includes memory transfers to / from device memory and non-blocking MPI calls. Therefore, it heavily depends on the MPI implementation on your system, the specific problem size and the type of memory (host, pinned or device memory) . You may use it to utilize CPU and GPU at the same time, by using transforms with different processing units set (assuming input / output is in host memory). We've measured up to a 20% runtime reduction compared to processing each transform individually, but there are also cases, where it can cause a slight performance penalty (potentially due to cache effects).

I'll close this for now, but feel free to reopen it if there are still related problems.