Improve performance of FFTDirichletExpanded

This PR improves the performence of FFTDirichletExpanded by combining the GPU kernels between the FFTs into a single kernel. This reduces memory usage by getting rid of the temporary field and increases performance by reducing the memory bandwidth needed (to the temporary field) and reducing kernel launch overhead for small resolutions.

[ ] Small enough (< few 100s of lines), otherwise it should probably be split into smaller PRs
[ ] Tested (describe the tests in the PR description)
[ ] Runs on GPU (basic: the code compiles and run well with the new module)
[ ] Contains an automated test (checksum and/or comparison with theory)
[ ] Documented: all elements (classes and their members, functions, namespaces, etc.) are documented
[ ] Constified (All that can be const is const)
[ ] Code is clean (no unwanted comments, )
[ ] Style and code conventions are respected at the bottom of https://github.com/Hi-PACE/hipace
[ ] Proper label and GitHub project, if applicable

Hi-PACE / hipace

Improve performance of FFTDirichletExpanded #1111