Libsharp / libsharp

Library for fast spherical harmonic transforms, see http://arxiv.org/abs/1303.4945
GNU General Public License v2.0
24 stars 17 forks source link

GPU support #18

Closed zonca closed 5 years ago

zonca commented 5 years ago

Is there any plan to have a CUDA version of libsharp?

mreineck commented 5 years ago

I'm not planning to implement this - very big effort, and I don't see sufficient payback in terms of performance.

Are you asking because you are encountering any concrete bottlenecks?

mreineck commented 5 years ago

To be a bit more specific: libsharp SHTs reach > 100 GFlops/s on modern 4-core CPUs; it should be quite hard to beat this by much on GPUs, since we need double precision.

nschaeff commented 5 years ago

Hi, just to give some actual numbers: the cuda version of the SHTns spherical harmonic transform library achieves around 400+ GFlops on a K80 gpu (Kepler), and 1000+ GFlops on a P100 gpu (Pascal). These numbers were measured with lmax=1023 and include the transfer from cpu to gpu and back. If your data stays on the gpu the perf is much higher (numbers not available).

In my opinion, the gpu can achieve good performance for SHT, but it is still comparable to what you can get from current highend (e.g. SkylakeX) servers (if you include the transfer cost). If your data stays on the gpu it may be interesting.

mreineck commented 5 years ago

Thanks for the data points!

This is roughly in the range that I expected. Depending on one's goals, going to GPU can definitely be advantageous, if hardware and development (and maintenance) costs are dominated by other factors. For libsharp, this is not the case; the goal is to stay as minimalistic and portable as possible, without any external dependencies. GPU support is fairly vendor-specific, I think (please correct me is that has changed!), comes with a lot of dependencies, and most GPU owners will not benefit from it, because they only have consumer cards with locked double precision support.

An additional note: if the GFlop numbers you mention are measured with shtns (which I think is highly likely :), they have to be reduced by 30 or 40 percent before comparing them to libsharp numbers, because the two libraries have slightly different methodologies for measuring performance.

mreineck commented 5 years ago

One additional thing to consider:

For grids that have the same number of pixels in all rings, the FFT part of the SHT can be carried out fairly efficiently on GPUs. Healpix has varying pixel numbers per ring (4,8,12,16,... 4*nside), which is probably a performance nightmare for graphics hardware.

nschaeff commented 5 years ago

Yes, all your comments and assumption are correct.

zonca commented 5 years ago

thanks