Fix issues with cmake and merge single/double precision codes.

These changes replace struct accfft_plan/accfft_plan_gpu with a new AccFFT/AccFFT_gpu class and use templates to combine float and double precision codes. The result is a smaller code that is easier to maintain. All tests appear to pass.

Note that looking at the diffs between accfft_gpu and accfft_gpuf, there are several calls to cudaDeviceSynchronize that only appeared in accfft_gpu. These have been commented in the combined code and marker "DOUBLE ONLY".

amirgholami / accfft

Fix issues with cmake and merge single/double precision codes. #24