CUFFT - Githubissues

timholy commented 10 years ago

I'm a newbie when it comes to CUDA, but a former member of my lab wrote a MEX file that uses CUDA to compute FFTs. I'm now trying to port this to Julia. I've begun a wrap of CUFFT, but I'm getting a segfault when I use it in conjunction with this repository. The test is test/test.jl in that repository, and it gives me the message

julia> include("test.jl")
CUDA Driver Initialized
terminate called after throwing an instance of 'cufftResult_t'
Aborted (core dumped)
[tim@cannon test]$

I was such a newbie at this when I began that I didn't even understand the distinction between the "driver" and "runtime" APIs, and initially I mistakenly thought (because of what turned up when I googled) that your CUDA.jl was targeting CUDA 3.2. So I began to do a wrap of what I thought was the "modern" API, but which I now understand is the Runtime API. It turns out that the segfault doesn't happen when I do a call to cudaSetDevice, which, if I understand correctly, initializes the context behind-the-scenes, so it should be roughly equivalent to the version that segfaults (?).

Now that I understand the landscape better, I see that the driver API is perfectly modern and apparently more flexible than the runtime API. (In particular, modules seem to be restricted to the driver API.) CUFFT seems to target the Runtime API. However, several things I've seen online suggest that you can mix the two APIs, so it seems like this should work. That's why I find the segfault surprising.

I'd rather consolidate our efforts, so frankly I'd prefer to contribute to your CUDA.jl rather than create a second wrapper. But I do need FFTs. Any ideas?

(If it's helpful to you, I can push my start at the runtime API wrapper to a public place, so you have access to it.)

lindahua commented 10 years ago

Thanks for the effort. We definitely should work out a way such that driver and runtime functions should work together. Will look into this later.

Do you have the codes that cause the segfault?

timholy commented 10 years ago

Just run test/test.jl in CUFFT.jl

moon6pence commented 10 years ago

Hi all, I'm a student applying JSOC 2014. I'm digging up some GPU issues.

This issue seem to be same problem with next issue, https://github.com/lindahua/CUDA.jl/issues/2 CUFFT also need context with cuCtxCreate_v2 with CUDA_API_VERSION >= 3020

There's a lot of "_v2" function defines with C macro in cuda.h in CUDA driver api.

#if defined(__CUDA_API_VERSION_INTERNAL) || __CUDA_API_VERSION >= 3020
    #define cuDeviceTotalMem                    cuDeviceTotalMem_v2
    #define cuCtxCreate                         cuCtxCreate_v2
    #define cuModuleGetGlobal                   cuModuleGetGlobal_v2
    ... (over 1 pages of functions)

There's also some more re-defines for CUDA_API_VERSION >= 4000. OTL

I think we have to select function name according to CUDA version, or simply we can just support later than CUDA 4.0 (it seems that there's no re-defines after 4.0)

lindahua commented 10 years ago

@moon6pence Thanks for this. I will take a look at those re-defines.

timholy commented 10 years ago

I should also say that in the meantime I've generated a wrapper for CUFFT and the runtime API. I haven't released it yet because it's missing some stuff, mostly documentation.

lindahua commented 10 years ago

I have made a major fix to the CUDA package, correcting the new function names (by looking at the defines in cuda.h) as @moon6pence suggested.

Is this now working?

timholy commented 10 years ago

Haven't had a chance to test, but I will.

In the meantime, even though nothing is documented I decided to push my runtime API repository and an updated, working CUFFT. That way you can at least see how the code has evolved (bottom line: quite a lot). I used Clang to do a low-level complete wrap of both libraries, and then put a Julian wrapper on top of that.

moon6pence commented 10 years ago

I tested that both this issue and https://github.com/lindahua/CUDA.jl/issues/2 works well with new patch. Changing functions to _v2 version was enough, not the problem between CUDA driver and runtime APIs.

And I have looked around CUDA runtime API wrapper @timholy I have some comments about CUDA driver API and runtime API:

We have no choice but to use CUDA driver API, we can support PTX module/function loading with ONLY driver API.
Driver API is low-level and covers all CUDA features, including newest features such as Unified Memory in CUDA 6.0
Runtime API look like thin-wrapper of driver API, to provide more user-friendly programming interface for C/C++ programmer.

I think it is enough to use driver API only to support all CUDA features. In addition, we can develop much better interface with julia, than what runtime API does to C/C++ programmer. However, CUDArt.jl has more features than CUDA.jl such as strided device array and it has to be integrated into one single GPU package well, I wish.

(I hope I can contribute a little to that :smiley:)

timholy commented 10 years ago

I'm loading PTX modules with the CUDArt package all the time. Indeed it already has a nice do style mechanism for closing down resources when it's done. Of course, it uses the driver API to achieve those functions.

But if one can wrap CUFFT and CUBLAS with the driver API, then it may be the better target. I just didn't figure out how to pull that off, whereas it seems that you have done so.

Off the top of my head, aside from the runtime/driver distinction, here are the main ways in which CUDArt.jl improves on CUDA.jl:

The Clang wrapper gives complete API coverage
As you say, many more array types are handled in CUDArt
Better control over memory cleanup
Better module cleanup
Several kernels that are important for operations like copy! to pitched arrays
Better support for streams and for multiple devices (I usually run on a machine with 4 NVIDIA cards in it)

One option is simply to say that CUDArt is the new-and-improved CUDA, and deprecate CUDA. After all, if you start from the runtime API, you can call any driver API function you want, whereas in my initial tests (without the fix you found) it wasn't true the other way around.

The alternative is to port these features from CUDArt to CUDA. Since I (1) quite desperately needed to get CUDA stuff running for my own coding needs, (2) put a lot of time into getting CUDArt working, and (3) am now using it all the time in my own work, I confess that with all the other things on my plate porting them back to CUDA.jl is not likely to rise to high spot on my priority list anytime soon. But if you're willing and think that's the best path, go for it.

moon6pence commented 10 years ago

Thanks for long comment. To be clearly, I intended to say "Which library is better to use with julia, CUDA driver or runtime API?", not to compare two julia to CUDA wrapper: CUDA.jl and CUDArt.jl.

(I thought that driver API is slightly better, because it looks more primitive. Anyway it is not import problem, we can mix driver and runtime API after CUDA 3.x and both API provide almost same functionality.)

And about julia wrappers. I am digging up CUDArt.jl package today. It seem to demonstrate how julia code have to be rather than simple API wrapper. Because I am C++ / CUDA programmer for long time and newbie in julia world, this code helps me a lot.

A little comment about testcase for multiple GPU in CUDArt.jl : It wasn't scheduled like testcase intended in my machine with 2 GTX 680. @async section is not run in turn, sometimes change order.

timholy commented 10 years ago

Oh, I was pretty sure you were mostly commenting on the different APIs. I just meant that if we decide the driver API is the way to go, there is a fair amount of porting work to be done :).

That's good to know about the order. I have early-period Teslas (CC 2.0), and their stream capabilities leave a lot to be desired. Does it fix it if you insert a sleep(sleeptime/10) just before the @async block?

timholy commented 10 years ago

OK, I've moved CUDArt and CUFFT over to JuliaGPU.

We need to resolve an important issue, however: what to do about CUDA vs CUDArt? I see three options:

Keep both packages
Deprecate one package
Merge the packages

I'm not sure the first one makes much sense, so I suspect we should do the second or third.

Points in favor of keeping CUDA:

CUDA is the "grandmother" of all of these; the only reason for the CUDArt split was the apparent difficulty of incorporating it with CUFFT, which @moon6pence apparently has solved (needs testing)
There may be some advantages in targeting the lowest-level API.
In addition to the work here, @moon6pence has done some nice work enhancing CUDA.jl (https://github.com/moon6pence/CUDA.jl)

Points in favor of just going with CUDArt:

CUDArt is quite a bit more full-featured at this point (even with @moon6pence's improvements), and it will take some effort to port this over to the driver API. In addition to all the Julia code, note that I've written some very useful kernels (functions needed to implement fill! for CUDA arrays, timer functions useful for debugging streams) and an infrastructure that loads them automatically.
CUFFT already works and has been built on CUDArt
The runtime API is probably more widely used by C/C++ programmers, and is better documented by NVIDIA and in online forums
For missing functionality, CUDArt dips into the driver API when needed (e.g., modules). I've not seen any problems whatsoever from that, and NVIDIA advertises that this works. However, this does force developers of CUDArt to know a bit about two APIs, which is a bit inconvenient.

I was reluctant to push CUDArt as "the" solution before, and about 2/3 of that was my reluctance to own yet another package :smile:. But with the move to JuliaGPU likely taking some of the maintenance burden off my shoulders, to be perfectly honest I think the balance of arguments are in favor of going with CUDArt. Nevertheless, if others want to go to the effort to backport the functionality in CUDArt to CUDA and then deprecate CUDArt, I will have no objections.

moon6pence commented 10 years ago

Hi, I'm back to this issue!

Congratulation to launch of brand new JuliaGPU! :fireworks: :fireworks: :fireworks: The issue is becoming serious concern of future of JuliaGPU :smiley:

I'm a poor student just following pioneer's works, but I'd better add small words to issue.

The runtime API is probably more widely used by C/C++ programmers, and is better documented by NVIDIA and in online forums

This may be important point. CUDA Driver API may looks like old and NVIDIA provide more document and examples for CUDA runtime APIs. I think we'd better check how it changed in CUDA 6 (for example, does new feature such as unified memory support both APIs?) I will test CUDA 6 and report about that soon.

Another point is, is there any active user for both package, currently? Of course @timholy is using CUDArt.jl actively for maybe his in-house code, but I'm afraid if no one is using CUDA.jl for their production or experiments over vecAdd example yet.

I don't care about recent my working to CUDA.jl. I made some codes but that was almost reviewing of two packages. I'm newbie in julia and it was good study to me. I just wanted to say "Hey, I can help you to improve julia for GPU computing", code will be better than my poor poor proposal T_T

Anyway, good luck to brand new JuliaGPU and cheer up to us!

lindahua commented 10 years ago

@timholy I think CUDArt.jl is a great package. It is definitely more full-featured that the simple wrapper that is CUDA.jl

In terms of the road ahead, I think we should probably build upon CUDArt.jl. We can add driver functions if needs do come up.

IMHO, what CUDArt needs at this point is some documentation, so that we can see more clearly who it offers.

timholy commented 10 years ago

IMHO, what CUDArt needs at this point is some documentation, so that we can see more clearly who it offers.

Agreed that this is a major omission. It will also help my lab, as well as the wider community. I've started the process, see https://github.com/JuliaGPU/CUDArt.jl.

mattcbro commented 8 years ago

The CUDA package saved my bacon this week. It is simple, easy to understand and gets the job done. It sure beats having to wrap the kernels in C code! Do you guys take donations?

Maybe you've already thought of this, but it's critical as you develop these capabilities that you have a mechanism for sharing a CUDA buffer between say a custom kernel using the CUDA package and one of the CUBLAS libraries, or the FFT library.

It's a common use case. I use a custom kernel to fill up a matrix and then run some standard linear algebra or FFT operations on it. What you don't want to do is to bring the data back into host memory until you are done.

When you have even better integration you can overload the cublas and fft libraries to operate directly on a CuArray. Much of the julia array semantics could be preserved. Oh one other way to improve things. You should be able to get rid of the free calls. I wrapped an entire cuda library in C++ and used the normal scoping memory management to delete the cuda buffers when objects go out of scope. I suspect your garbage collector might be able to finalize the cuda object when it cleans up. It depends on how it's implemented.

Having said all that I ran into some peculiar problems with CUDA wherein to_host() ground to a standstill, sometimes even timing out, while trying to transfer data back to main memory. I'm not sure if I can isolate that, or even if it's entirely a Julia issue.

I'm running julia 0.3.11.

timholy commented 8 years ago

CUDArt (https://github.com/JuliaGPU/CUDArt.jl) and CUFFT already integrate nicely, and CUDArt also implements finalizers and memory management.

mattcbro commented 8 years ago

Glad to hear that you are on the right track. As it is I want to extract my CuArray buffer from the CUDA package, and feed it into the cuda version of gemm, the matrix multiplier. Any way I can do that now? Maybe the CuArray wrapper is straightforward? I'll take a look shortly.

Having built libraries like this a bunch of times this is a pretty critical need to get the kind of 10-100 factors of speed we are looking for. One other thing is that you want to be able to take a raw blob of existing GPU memory and wrap a lightweight array struct around it. In fact you want to be able to create views into an array to create subarrays and use that in either a custom kernel or in a vendor supplied library like CUFFT.

I am probably just preaching to the choir here, since anyone who messes with numerical linear algebra runs into these kinds of performance issues. Plus you see it in action with libraries like numpy or vsipl and so forth.

timholy commented 8 years ago

Click at the top of this page on "JuliaGPU", then browse the list of packages. You'll see the CUBLAS.jl package.

JuliaAttic / CUDA.jl

CUFFT #1