flatironinstitute / cufinufft

Nonuniform fast Fourier transforms of types 1 and 2, in 1D, 2D, and 3D, on the GPU
Other
83 stars 18 forks source link

Add support for AMD GPUs via HIP #116

Open elliottslaughter opened 2 years ago

elliottslaughter commented 2 years ago

This is work in progress branch to add support for AMD GPUs via HIP, a very nearly CUDA-compatible API by AMD. While this isn't ready to go yet, it's far enough along that I wanted to get feedback on the approach I'm taking.

The basic idea of HIP is to provide CUDA-compatible APIs that are identical except for naming. In most cases you can find-and-replace cuda for hip and things will work.

Rather than literally doing that transformation to the code (the way AMD seems to want you to do it), I'm using an approach with a header wrapper around cuda.h / hip/hip_runtime.h. Then I add some translations with #define cudaX hipX for each X API call used in the code. The advantage of going this route is that the vast majority of the code does not need to change at all, at least for a functionally correct version. I haven't looked at performance at all yet, and there may be more work to do there.

I also made a few minor changes to the build system, mostly to provide enough configuration flexibility so that site files are sufficient to build against HIP. I provide an example site in sites/make.inc.olcf_spock for Spock.

Status:

I'd appreciate a review of the basic approach I'm using, and if anyone has advice on how to address the remaining tests failures, that would be helpful. Thanks!

P.S. This may also be sufficient to get Intel GPUs working via HIPCL, though I haven't tested that.

elliottslaughter commented 2 years ago

Update: I upgraded to ROCm/HIP 4.5.0 (from 4.2.0) and it resolved the issue with symbols. The remaining tests that fail seem to be some sort of a numerical issue.

ahbarnett commented 2 years ago

This seems like a worthy goal. I do not have an AMD GPU to test on - and am frankyl out of my depth - we will need others in the dev team to test!

Your macro-based approach looks good, at least as a way to get HIP started with minimal code changes. It unifies the headers for the two architectures. (Note that you now #include fft and complex headers always where they weren't previously - hope this isn't the cause of math errors). Certainly better than a search-and-replace recommended at blogs like https://www.lumi-supercomputer.eu/preparing-codes-for-lumi-converting-cuda-applications-to-hip/ !

To complete a PR you would want to add a doc section in README.md or a linked doc file about compiling on HIP. Basically, so everyone knows how to test what you've added.

Try to make the makefile have minimal changes, and a simple flag eg make check -DHIP

Now, re math, if you can post a single test command line that fails (you say a math-output rather than crash failure?) maybe someone w/ AMD can help us out?

Thanks, Alex

elliottslaughter commented 2 years ago

I've been doing some digging and think I've found the root cause, which seems to be related to the host-side code and not to the GPU at all.

My debugging shows that we're not taking the following branch because MAX_NF is 0.

https://github.com/flatironinstitute/cufinufft/blob/46a8e8a90049f07e9d35826c80db0ff6bac548bd/contrib/common.cpp#L31

MAX_NF meanwhile is defined as (BIGINT)1e11 as you can see below (note this is master, not my branch):

https://github.com/flatironinstitute/cufinufft/blob/46a8e8a90049f07e9d35826c80db0ff6bac548bd/contrib/common.h#L12

And BIGINT is defined as int, which is 32 bits in most implementations.

https://github.com/flatironinstitute/cufinufft/blob/46a8e8a90049f07e9d35826c80db0ff6bac548bd/contrib/dataTypes.h#L16

To be honest, I'm not sure how this has ever worked. If you compare 1e11 and 1<<32 (which is the largest you can fit in an unsigned int, so expect about 2x smaller for signed):

>>> 1e11
100000000000.0
>>> 1<<32
4294967296

You can see that 1e11 clearly overflows. You need to get down to about 1e9 to find something that actually fits in 32 bits.

I have confirmed that changing 1e11 to 1e9 does make the test pass. So it never had anything to do with the GPU in the first place.

I think the potential solutions could be decreasing MAX_NF to be INT_MAX (or similar), or we could change BIGINT to be 64 bits if you really want to be able to represent 1e11 in that number. Either way, it's not really a HIP issue so it should probably be fixed in a separate PR.

MelodyShih commented 2 years ago

Hi @elliottslaughter , thanks for reporting this bug and the proposed solutions. This is my mistake. In FINUFFT (the cpu version), BIGINT is defined as int64_t and everything makes sense.

I will create a PR to fix it.

elliottslaughter commented 2 years ago

For what it's worth, this fix is sufficient to get the entire test suite working on Spock. I'm still seeing issues on Crusher (similar system with newer AMD hardware) that need investigating.

ahbarnett commented 2 years ago

Hi Elliot, that's exciting re getting HIP going. Looks like pulling in 1d failed... sorry... but it's a worthwhile part of the code to include. Good luck & thanks for your collaboration on this. -A

elliottslaughter commented 2 years ago

I'm currently working on diagnosing the issue that happens on Crusher with newer AMD hardware. Currently (as of my most recent push), the entire test suite works on Spock (with previous-generation AMD hardware).

After AMD works in this branch, I'll get back to NVIDIA and the other issues identified in this thread.

elliottslaughter commented 2 years ago

@ahbarnett I wanted to follow up to your comment https://github.com/flatironinstitute/cufinufft/pull/134#pullrequestreview-883277905:

make check ... does not have official pass/fail detection; it's a matter of if the tests compile and run (no math check is actually done).

Do the Python tests have such checks? I ask because, so far I've been using make check for my AMD/HIP tests, and if they're not checked, it means I haven't actually validated the correctness. If Python does, that may be a path forward though I'll have to work around the lack of PyCUDA on AMD.

janden commented 2 years ago

Do the Python tests have such checks?

They do. I would run python3 -m pytest python/cufinufft/tests to verify correctness.