using rocFFT in an OpenCL application

pszi1ard commented 6 years ago

What is the expected behavior

I would expect the ROCm platform to provide an OpenCL API for the rocFFT library.

What actually happens

Only HIP bindings seems to be available

How to reproduce

Try to use rocFFT with OpenCL

Environment

| Hardware | description | Any

Software	version
ROCK	1.7.137
ROCR	1.1.7-12-gf0de514
HCC	1.2.18063
Library	git master

Note that this is a critical dependency for our work on bringing feature-parity with CUDA in the next GROMACS release.

bragadeesh commented 6 years ago

At present, there are no opencl bindings for rocFFT, because it requires some sort of interop functionality between HIP and OpenCL. We don't have that. There's no translation from HIP created memory buffer to a cl_mem object and vice-versa, for example.

The way I see it, there are 2 options.

Since you have CUDA based code, you could try to hipify that and use HIP as solution on AMD. This way you could switch to using rocFFT; rocFFT has a hipFFT interface similar to cuFFT.
If you want to keep the opencl infrastructure, clFFT is your best bet, but it is in maintenance mode with known failures; you can try using it on the rocm stack and let us know what doesn't work for you

whchung commented 6 years ago

@gstoner for awareness

@pszi1ard @bragadeesh I'd like to point out it's not entirely impossible, but does require some work in existing ROCm stack. To gain enough attention, perhaps the better place for the ticket shall be at: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime

No matter which programming languages you use for GPU computing on AMD hardware, on ROCm platform eventually they are compiled as "HSA code objects". HIP runtime has no trouble loading kernels compiled by OpenCL compiler, as long as the kernel arguments are properly prepared to match API and ABI requirements asked by HIP runtime. Basically that's how ROCm ports of DNN frameworks such as Caffe / TensorFlow / PyTorch / Caffe2 MXNet / CNTK are made possible to load & run computation kernels in MIOpen, which are mostly written in OpenCL.

On the other hand, as @pszi1ard pointed out, we probably need to extend OpenCL runtime on ROCm be able to load, prepare kernel arguments per HIP application ABI, and invoke them.

pszi1ard commented 6 years ago

Thanks for the quick feedback!

To the two options @bragadeesh suggested:

Can't/won't do; we want standards-based portable code, that's why we're working on OpenCL.
clFFT is broken on Vega, and has questionable performance (from the benchmarks I've seen before).

@whchung

To gain enough attention, perhaps the better place for the ticket shall be at: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime

What should the ticket state? I assumed filing an issue/RFE against rocFFT is the right thing as it's rocFFT that should have the bindings to take cl_mem objects as arguments.

To @whchung 's further points: it does not help us what the platform compiles code to because we want to develop the application, not the compilers/toolchain and we want to use OpenCL. From that pov, while it's just a nuance, what would be ideal is if rocFFT actually had an OpenCL API rather than the OpenCL runtime had some special HIP-capable/compatible extensions.

whchung commented 6 years ago

@pszi1ard What I proposed was to retrieve kernels within rocfft and load / launch it with clCreateProgramWithBinary, that's possible with perhaps just a few tweaks in https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime , so I proposed to raise the ticket there.

Since it seems your desired goal is to change rocfft API so it takes cl_mem objects as arguments. That nullifies my proposal and I'll leave that for @bragadeesh and @gstoner to decide the priority for that.

tingxingdong commented 6 years ago

Can't/won't do; we want standards-based portable code, that's why we're working on OpenCL.

If you only care about AMD and NVIDIA GPU for the "portable code" here, I mean not Intel, FPGA, etc. You can hipify your code. You hip code will run on Nvidia. You do not need to maintain a CUDA version. The HIP code will automatically call CUDA for you. But, you must have the source code.

https://gpuopen.com/hip-to-be-squared-an-introductory-hip-tutorial/

==================== like here, you only maintain *.cpp HIP code. HIP run it on NVIDIA Geforce Titan TITAN1:~/ben/hip/samples/square$ hipcc square.cpp -o square.hip.out

TITAN1:~/ben/hip/samples/square$ ./square.hip.out info: running on device GeForce GTX TITAN X info: allocate host mem ( 7.63 MB) info: allocate device mem ( 7.63 MB) info: copy Host2Device info: launch 'vector_square' kernel info: copy Device2Host info: check result PASSED!

==================== The identical *.cpp code recompile and run it on AMD Fiji, as well.

Fiji1:~/hip/samples/square$ hipcc square.cpp -o square.hip.out

Fiji1:~/hip/samples/square$ ./square.hip.out info: running on device Fiji info: allocate host mem ( 7.63 MB) info: allocate device mem ( 7.63 MB)) info: copy Host2Device info: launch 'vector_square' kernel info: copy Device2Host info: check result PASSED!

pszi1ard commented 6 years ago

@whchung Thanks for clarifying; actually, it was not entirely clear what you suggested, but now that I understand it better, I do think using clCreateProgramWithBinary to load precompiled rocfft kernels could be a sensible route too -- especially if that allows earlier OpenCL support. Hence, I'll file an RFE.

As a side-note, the main question is whether in the short run we should just hope that clFFT gets fixed (for Vega) and it is not too bad in performance or whether there is a chance there'll be some form of rocFFT support that will also be competitive in performance -- I know this is a broader question, but this is the original question that led me here. Note that we plan to release GROMACS code that would rely on FFTs this fall (and we hope that it will be more than just functional, but also competitive).

@tingxingdong Short answer: we don't want to hipify as, to be frank, before there is no major tractions around HIP, it's just technical debt that we'd be adding to our code-base. Additionally, we want portability not just to NVIDIA and AMD.

bragadeesh commented 6 years ago

@pszi1ard just to be clear, what @whchung suggesting is not directly usable by you. If we get such a support, then there is considerable rework that needs to be done in the rocFFT library to support a opencl interface. For you to use @whchung idea directly, you would have to take single 1D kernels and do all transposing and copying of data, essentially writing about half of fft functionality yourself.

We are discussing this internally on what is best way forward, we will let you know.

Can you give more info on the problems you are interested in? Is it all 3D FFT? single precision? real or complex? what factors for the sizes (pow 2,3,5 etc)?

yupinov commented 6 years ago

Hi @bragadeesh and everyone, I'm working on the Gromacs OpenCL implementation together with @pszi1ard. What we are interested in is indeed, 3D FFT, real to and from hermitian interleaved, single precision. Each dimension can realistically be from 24 up to 192. We can scale the grid dimensions, so the large prime factor support is not very important, while nice to have. Here is the issue I filed recently against clFFT, asking about the rocm support status. https://github.com/clMathLibraries/clFFT/issues/218

pszi1ard commented 6 years ago

@bragadeesh Thanks for the correction -- I should have realized myself that a 3D FFT (typically) computation consist of more than just a single kernel invocation, so it won't be as easy as loading a cl kernel from a binary for the full 3D transform.

That said, depending on how much effort it is and how much performance benefit it brings, I it might be worth for you to provide fused single-kernel small 3D transforms. In our experience from other platforms, the overheads involved the large-transform optimized multi-kernel 3D transforms seem so high, that moderately optimized fused 3D kernels (e.g. for factors 2 or 2/3) could end up being a lot faster.

Additionally, on the longer term we would definitely consider rolling our own 3D transforms based on the 1D FFT kernels, but I think these would need to be device-side callable for it to be worth it (considering kernel launch overheads and that we could overlap our grid generation with FFTs).

gstoner commented 6 years ago

Let’s build open Rav for what you want I have section in github for this. You write in markdown

Get Outlook for iOShttps://aka.ms/o0ukef

From: Szilárd Páll notifications@github.com Sent: Thursday, April 12, 2018 6:49:47 AM To: ROCmSoftwarePlatform/rocFFT Cc: Gregory Stoner; Assign Subject: Re: [ROCmSoftwarePlatform/rocFFT] using rocFFT in an OpenCL application (#120)

@bragadeeshhttps://github.com/bragadeesh Thanks for the correction -- I should have realized myself that a 3D FFT (typically) computation consist of more than just a single kernel invocation, so it won't be as easy as loading a cl kernel from a binary for the full 3D transform.

That said, depending on how much effort it is and how much performance benefit it brings, I it might be worth for you to provide fused single-kernel small 3D transforms. In our experience from other platforms, the overheads involved the large-transform optimized multi-kernel 3D transforms seem so high, that moderately optimized fused 3D kernels (e.g. for factors 2 or 2/3) could end up being a lot faster.

Additionally, on the longer term we would definitely consider rolling our own 3D transforms based on the 1D FFT kernels, but I think these would need to be device-side callable for it to be worth it (considering kernel launch overheads and that we could overlap our grid generation with FFTs).

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/ROCmSoftwarePlatform/rocFFT/issues/120#issuecomment-380777914, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD8DuauUS1yHiQctmc3ze_WIUodJqGylks5tnz9bgaJpZM4TOS9k.

pszi1ard commented 6 years ago

Let’s build open Rav for what you want I have section in github for this. You write in markdown

Can you please clarify what you mean? Do you want me to write down something? What is "open Rav"?

gstoner commented 6 years ago

IOS changed RFQ to RAV. I asking we build out RFQ-SRS for what you need.

gstoner commented 6 years ago

Here is the RFC template and place to manage them. https://github.com/RadeonOpenCompute/rfcs

pszi1ard commented 6 years ago

@gstoner I filed and RFC pull request, also noted some meta-stuff about the RFC format there.

Is there still a point to file a separate bug report or RFC for the OpenCL runtime feature @whchung suggested?

pszi1ard commented 6 years ago

Ping. Quite some time has passed and I've yet to receive feedback here or on the RFC.

bragadeesh commented 6 years ago

@pszi1ard we don't have any update on the addition of cl interfaces for rocFFT; as I mentioned before it requires a lot of plumbing in the stack which we don't have consensus on yet we did put resources to get clFFT compiler issues fixed and it is progressing

pszi1ard commented 6 years ago

Thanks for the update!

we did put resources to get clFFT compiler issues fixed and it is progressing

Do you have a release ETA?

bragadeesh commented 6 years ago

Unfortunately do not have timeline, other than to say clFFT validation on rocm platform is getting attention, what hardware do you plan to use with rocm?

pszi1ard commented 6 years ago

We use RX 560s in CI and do development/testin on Vega (and Fiji).

I was however hoping to recommend ROCm to our users as the recommended platform for our next release (ETA ~end of 2018), but for that we'd need a stable, if not performant FFT library. From that point of view, it would be great if all ROCm-supported hardware was at least validated / correct with clFFT.

psteinb commented 5 years ago

Hi to all, just stumbled upon this issue and was wondering if @pszi1ard could make the earlier statement

clFFT is broken on Vega, and has questionable performance (from the benchmarks I've seen before). more precise.

What we've seen so far, is that clfft under ROCM on vega64 works decently: gearshifft-vega64-gv100 I hope the screenshot contains most needed details on the benchmark @tdd11235813 did.

pszi1ard commented 5 years ago

@psteinb Last time I checked (with ROCm 2.0) there were still failing regression tests, see https://github.com/clMathLibraries/clFFT/issues/218

In terms of performance, I'm doubtful the performance is competitive with the state of the art. It may be that clFFT is slower on the GV100 is slower, but that's the wrong comparison IMHO, In this particular case the right comparison is cuFFT which is a lot faster (up to 5x in the small 3D transforms regime we care about).

pszi1ard commented 5 years ago

@bragadeesh any updates? can we expect any changes on either clFFT or rocFFT in the foreseeable future? Performance with clFFT is still very poor an in fact it seems to be regressing [1];

https://github.com/RadeonOpenCompute/ROCm/issues/773

bragadeesh commented 5 years ago

@pszi1ard on the rocFFT side supporting opencl interface is not getting high priority at this time; and clFFT not actively developed; are you still locked to opencl? is HIP an option? Let me explore what can be done.

OTOH, can you describe the 3D FFTs and sizes you are looking for? Sorry if you have given this info before, if you can point me to relevant sizes of interest, that would be helpful.

pszi1ard commented 5 years ago

@pszi1ard on the rocFFT side supporting opencl interface is not getting high priority at this time;

@bragadeesh quite unhappy to hear.

and clFFT not actively developed;

That is what I've implied based on the level of activity. Is there no community interest either -- as far as you know?

are you still locked to opencl? is HIP an option? Let me explore what can be done.

Short answer: Yes /No(t really(

No, GROMACS is not "locked in", on the contrary, we are choosing open standards-based programming models and given our limited resources, and especially when it comes to hardware that has negligible use in our user-base, we can't invest in proprietary stacks.

BTW if there is easy OpenCL - HIP interop, we have quite modular code and could plug in HIP-based FFTs into the application (this is all we need: https://github.com/gromacs/gromacs/blob/master/src/gromacs/ewald/pme_gpu_3dfft_ocl.cpp). However, realistically, if we are to get something better for AMD GPUs before the ~2021 timeframe, we need something soon (before mid-September in time for our 2020 release freeze).

OTOH, can you describe the 3D FFTs and sizes you are looking for? Sorry if you have given this info before, if you can point me to relevant sizes of interest, that would be helpful.

Sure, briefly this is what we need: R2C / C2R, float, 3D transforms, data resident on GPU (grids generated by preceding kernel). Sizes anywhere between 64-256 per dim (not only power of two) most commonly, less commonly <32 or >256; we do filter out "nasty" factors and can tweak grid size if there is a known heuristic to apply (also see the above linked file).

Let me know if you have thoughts on how to proceed.

bragadeesh commented 5 years ago

@feizheng10 @malcolmroberts please note the sizes of interest, we can discuss offline on opencl interface

doctorcolinsmith commented 2 years ago

Closing due to no new activity.

pszi1ard commented 2 years ago

@doctorcolinsmith Can you please clarify what exactly do you mean? Interop with OpenCL is a major shortcoming of the ROCm libraries not just rocFFT.

Closgin due to no activity is quite unclear, should we interpret this as a "wontfix"? i.e. does this mean that AMD has no intentions to support rocFFT (or ROCm libs in general) from OpenCL?

ROCm / rocFFT