Closed pszi1ard closed 2 years ago
At present, there are no opencl bindings for rocFFT, because it requires some sort of interop functionality between HIP and OpenCL. We don't have that. There's no translation from HIP created memory buffer to a cl_mem object and vice-versa, for example.
The way I see it, there are 2 options.
@pszi1ard @bragadeesh I'd like to point out it's not entirely impossible, but does require some work in existing ROCm stack. To gain enough attention, perhaps the better place for the ticket shall be at: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime
No matter which programming languages you use for GPU computing on AMD hardware, on ROCm platform eventually they are compiled as "HSA code objects". HIP runtime
has no trouble loading kernels compiled by OpenCL compiler, as long as the kernel arguments are properly prepared to match API and ABI requirements asked by HIP runtime
. Basically that's how ROCm ports of DNN frameworks such as Caffe / TensorFlow / PyTorch / Caffe2 MXNet / CNTK are made possible to load & run computation kernels in MIOpen
, which are mostly written in OpenCL.
On the other hand, as @pszi1ard pointed out, we probably need to extend OpenCL runtime
on ROCm be able to load, prepare kernel arguments per HIP application ABI, and invoke them.
Thanks for the quick feedback!
To the two options @bragadeesh suggested:
@whchung
To gain enough attention, perhaps the better place for the ticket shall be at: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime
What should the ticket state? I assumed filing an issue/RFE against rocFFT is the right thing as it's rocFFT that should have the bindings to take cl_mem
objects as arguments.
To @whchung 's further points: it does not help us what the platform compiles code to because we want to develop the application, not the compilers/toolchain and we want to use OpenCL. From that pov, while it's just a nuance, what would be ideal is if rocFFT actually had an OpenCL API rather than the OpenCL runtime had some special HIP-capable/compatible extensions.
@pszi1ard What I proposed was to retrieve kernels within rocfft
and load / launch it with clCreateProgramWithBinary
, that's possible with perhaps just a few tweaks in https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime , so I proposed to raise the ticket there.
Since it seems your desired goal is to change rocfft
API so it takes cl_mem
objects as arguments. That nullifies my proposal and I'll leave that for @bragadeesh and @gstoner to decide the priority for that.
Can't/won't do; we want standards-based portable code, that's why we're working on OpenCL.
If you only care about AMD and NVIDIA GPU for the "portable code" here, I mean not Intel, FPGA, etc. You can hipify your code. You hip code will run on Nvidia. You do not need to maintain a CUDA version. The HIP code will automatically call CUDA for you. But, you must have the source code.
https://gpuopen.com/hip-to-be-squared-an-introductory-hip-tutorial/
==================== like here, you only maintain *.cpp HIP code. HIP run it on NVIDIA Geforce Titan TITAN1:~/ben/hip/samples/square$ hipcc square.cpp -o square.hip.out
TITAN1:~/ben/hip/samples/square$ ./square.hip.out info: running on device GeForce GTX TITAN X info: allocate host mem ( 7.63 MB) info: allocate device mem ( 7.63 MB) info: copy Host2Device info: launch 'vector_square' kernel info: copy Device2Host info: check result PASSED!
==================== The identical *.cpp code recompile and run it on AMD Fiji, as well.
Fiji1:~/hip/samples/square$ hipcc square.cpp -o square.hip.out
Fiji1:~/hip/samples/square$ ./square.hip.out info: running on device Fiji info: allocate host mem ( 7.63 MB) info: allocate device mem ( 7.63 MB)) info: copy Host2Device info: launch 'vector_square' kernel info: copy Device2Host info: check result PASSED!
@whchung Thanks for clarifying; actually, it was not entirely clear what you suggested, but now that I understand it better, I do think using clCreateProgramWithBinary
to load precompiled rocfft kernels could be a sensible route too -- especially if that allows earlier OpenCL support. Hence, I'll file an RFE.
As a side-note, the main question is whether in the short run we should just hope that clFFT gets fixed (for Vega) and it is not too bad in performance or whether there is a chance there'll be some form of rocFFT support that will also be competitive in performance -- I know this is a broader question, but this is the original question that led me here. Note that we plan to release GROMACS code that would rely on FFTs this fall (and we hope that it will be more than just functional, but also competitive).
@tingxingdong Short answer: we don't want to hipify as, to be frank, before there is no major tractions around HIP, it's just technical debt that we'd be adding to our code-base. Additionally, we want portability not just to NVIDIA and AMD.
@pszi1ard just to be clear, what @whchung suggesting is not directly usable by you. If we get such a support, then there is considerable rework that needs to be done in the rocFFT library to support a opencl interface. For you to use @whchung idea directly, you would have to take single 1D kernels and do all transposing and copying of data, essentially writing about half of fft functionality yourself.
We are discussing this internally on what is best way forward, we will let you know.
Can you give more info on the problems you are interested in? Is it all 3D FFT? single precision? real or complex? what factors for the sizes (pow 2,3,5 etc)?
Hi @bragadeesh and everyone, I'm working on the Gromacs OpenCL implementation together with @pszi1ard. What we are interested in is indeed, 3D FFT, real to and from hermitian interleaved, single precision. Each dimension can realistically be from 24 up to 192. We can scale the grid dimensions, so the large prime factor support is not very important, while nice to have. Here is the issue I filed recently against clFFT, asking about the rocm support status. https://github.com/clMathLibraries/clFFT/issues/218
@bragadeesh Thanks for the correction -- I should have realized myself that a 3D FFT (typically) computation consist of more than just a single kernel invocation, so it won't be as easy as loading a cl kernel from a binary for the full 3D transform.
That said, depending on how much effort it is and how much performance benefit it brings, I it might be worth for you to provide fused single-kernel small 3D transforms. In our experience from other platforms, the overheads involved the large-transform optimized multi-kernel 3D transforms seem so high, that moderately optimized fused 3D kernels (e.g. for factors 2 or 2/3) could end up being a lot faster.
Additionally, on the longer term we would definitely consider rolling our own 3D transforms based on the 1D FFT kernels, but I think these would need to be device-side callable for it to be worth it (considering kernel launch overheads and that we could overlap our grid generation with FFTs).
Let’s build open Rav for what you want I have section in github for this. You write in markdown
Get Outlook for iOShttps://aka.ms/o0ukef
From: Szilárd Páll notifications@github.com Sent: Thursday, April 12, 2018 6:49:47 AM To: ROCmSoftwarePlatform/rocFFT Cc: Gregory Stoner; Assign Subject: Re: [ROCmSoftwarePlatform/rocFFT] using rocFFT in an OpenCL application (#120)
@bragadeeshhttps://github.com/bragadeesh Thanks for the correction -- I should have realized myself that a 3D FFT (typically) computation consist of more than just a single kernel invocation, so it won't be as easy as loading a cl kernel from a binary for the full 3D transform.
That said, depending on how much effort it is and how much performance benefit it brings, I it might be worth for you to provide fused single-kernel small 3D transforms. In our experience from other platforms, the overheads involved the large-transform optimized multi-kernel 3D transforms seem so high, that moderately optimized fused 3D kernels (e.g. for factors 2 or 2/3) could end up being a lot faster.
Additionally, on the longer term we would definitely consider rolling our own 3D transforms based on the 1D FFT kernels, but I think these would need to be device-side callable for it to be worth it (considering kernel launch overheads and that we could overlap our grid generation with FFTs).
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/ROCmSoftwarePlatform/rocFFT/issues/120#issuecomment-380777914, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD8DuauUS1yHiQctmc3ze_WIUodJqGylks5tnz9bgaJpZM4TOS9k.
Let’s build open Rav for what you want I have section in github for this. You write in markdown
Can you please clarify what you mean? Do you want me to write down something? What is "open Rav"?
IOS changed RFQ to RAV. I asking we build out RFQ-SRS for what you need.
Here is the RFC template and place to manage them. https://github.com/RadeonOpenCompute/rfcs
@gstoner I filed and RFC pull request, also noted some meta-stuff about the RFC format there.
Is there still a point to file a separate bug report or RFC for the OpenCL runtime feature @whchung suggested?
Ping. Quite some time has passed and I've yet to receive feedback here or on the RFC.
@pszi1ard we don't have any update on the addition of cl interfaces for rocFFT; as I mentioned before it requires a lot of plumbing in the stack which we don't have consensus on yet we did put resources to get clFFT compiler issues fixed and it is progressing
Thanks for the update!
we did put resources to get clFFT compiler issues fixed and it is progressing
Do you have a release ETA?
Unfortunately do not have timeline, other than to say clFFT validation on rocm platform is getting attention, what hardware do you plan to use with rocm?
We use RX 560s in CI and do development/testin on Vega (and Fiji).
I was however hoping to recommend ROCm to our users as the recommended platform for our next release (ETA ~end of 2018), but for that we'd need a stable, if not performant FFT library. From that point of view, it would be great if all ROCm-supported hardware was at least validated / correct with clFFT.
Hi to all, just stumbled upon this issue and was wondering if @pszi1ard could make the earlier statement
clFFT is broken on Vega, and has questionable performance (from the benchmarks I've seen before). more precise.
What we've seen so far, is that clfft under ROCM on vega64 works decently: I hope the screenshot contains most needed details on the benchmark @tdd11235813 did.
@psteinb Last time I checked (with ROCm 2.0) there were still failing regression tests, see https://github.com/clMathLibraries/clFFT/issues/218
In terms of performance, I'm doubtful the performance is competitive with the state of the art. It may be that clFFT is slower on the GV100 is slower, but that's the wrong comparison IMHO, In this particular case the right comparison is cuFFT which is a lot faster (up to 5x in the small 3D transforms regime we care about).
@bragadeesh any updates? can we expect any changes on either clFFT or rocFFT in the foreseeable future? Performance with clFFT is still very poor an in fact it seems to be regressing [1];
@pszi1ard on the rocFFT side supporting opencl interface is not getting high priority at this time; and clFFT not actively developed; are you still locked to opencl? is HIP an option? Let me explore what can be done.
OTOH, can you describe the 3D FFTs and sizes you are looking for? Sorry if you have given this info before, if you can point me to relevant sizes of interest, that would be helpful.
@pszi1ard on the rocFFT side supporting opencl interface is not getting high priority at this time;
@bragadeesh quite unhappy to hear.
and clFFT not actively developed;
That is what I've implied based on the level of activity. Is there no community interest either -- as far as you know?
are you still locked to opencl? is HIP an option? Let me explore what can be done.
Short answer: Yes /No(t really(
No, GROMACS is not "locked in", on the contrary, we are choosing open standards-based programming models and given our limited resources, and especially when it comes to hardware that has negligible use in our user-base, we can't invest in proprietary stacks.
BTW if there is easy OpenCL - HIP interop, we have quite modular code and could plug in HIP-based FFTs into the application (this is all we need: https://github.com/gromacs/gromacs/blob/master/src/gromacs/ewald/pme_gpu_3dfft_ocl.cpp). However, realistically, if we are to get something better for AMD GPUs before the ~2021 timeframe, we need something soon (before mid-September in time for our 2020 release freeze).
OTOH, can you describe the 3D FFTs and sizes you are looking for? Sorry if you have given this info before, if you can point me to relevant sizes of interest, that would be helpful.
Sure, briefly this is what we need: R2C / C2R, float, 3D transforms, data resident on GPU (grids generated by preceding kernel). Sizes anywhere between 64-256 per dim (not only power of two) most commonly, less commonly <32 or >256; we do filter out "nasty" factors and can tweak grid size if there is a known heuristic to apply (also see the above linked file).
Let me know if you have thoughts on how to proceed.
@feizheng10 @malcolmroberts please note the sizes of interest, we can discuss offline on opencl interface
Closing due to no new activity.
@doctorcolinsmith Can you please clarify what exactly do you mean? Interop with OpenCL is a major shortcoming of the ROCm libraries not just rocFFT.
Closgin due to no activity is quite unclear, should we interpret this as a "wontfix"? i.e. does this mean that AMD has no intentions to support rocFFT (or ROCm libs in general) from OpenCL?
What is the expected behavior
What actually happens
How to reproduce
Environment
| Hardware | description | Any
Note that this is a critical dependency for our work on bringing feature-parity with CUDA in the next GROMACS release.