"Free" (or at least cheap) hardware acceleration via openMP

The hybrid openMP/MPI branch (#1628) uses various openMP directives to parallelize the computation (e.g. #pragma omp parallel for).

More recent versions of openMP (circa 2018) support offloading the same computation onto hardware accelerators (e.g. GPUs), with very little modification to the same compiler directives. We would just have to make sure data that is meant to stay on the accelerator actually stays on the accelerator for a certain amount of time to overcome the hit from communication.

For example, we could create a function called run_until(n) that continuously timesteps for n steps without any interrupts (currently the run(until=n) calls back to python each iteration). All of the timestepping, dft-ing, etc. can be performed on the accelerator. Even convergence checks can be performed on the accelerator. The main benefit to using an accelerator for FDTD, of course, would be the extremely high memory bandwidths (FDTD is generally memory-bound, not compute bound).

In the past, pursuing hardware acceleration was rather undesirable as this required a custom kernel written using a proprietary API. While some directive-level shortcuts have existed for a long time (e.g. OpenACC) there wasn't enough motivation to justify the time sink. However, since we are already playing with OpenMP, it might be worth extending (or at least exploring) the functionality to also support basic accelerators.

NanoComp / meep

"Free" (or at least cheap) hardware acceleration via openMP #1719