AccelerateHS / accelerate

Embedded language for high-performance array computations
https://www.acceleratehs.org
Other
901 stars 118 forks source link

Mutable arrays / lower level interface #86

Open js6i opened 11 years ago

js6i commented 11 years ago

Hi,

I figured this would be a good place to post this.

For more memory reusing one could use a lower level interface bridging between accelerate and cuda. Something along the lines of:

ptr1 <- getDevicePtr $ use initialData ptr2 <- getDevicePtr newDeviceArray krn <- getKernel algorithm result <- sequence . take steps . cycle $ [krn ptr1 ptr2, krn ptr2 ptr1]

Where krn p1 p2 would indicate p1 as input and p2 as output array.

In my case this is motivated by the fact that my kernel takes at most few hundred ms to execude, while the whole step with allocation takes a few seconds. With hundreds of steps and enough samples for statistics this is unacceptable (and obviously I'd love to do my stuff in haskell and not C++).

Would it be a difficult task to add this features? Or maybe there's another way of reusing memory? I tried keeping my loop in Acc, but it was no good either (according to -ddump-gc output arrays were reallocated anyway, and I couldn't do stuff between steps).

Thanks for your consideration Jan Sikorski

mchakravarty commented 11 years ago

I can imagine that what you like to do is a common idiom. Hence, I wonder whether we cannot find a better high-level way to specify this idiom in Accelerate such that it can generate efficient code for you.

Have you maybe got an example program (may be a cut down version of your application) that exemplifies your problem? (I find it much easier to work from a concrete example.)

js6i commented 11 years ago

A slightly more general case would be when the next step depends on a set of N previous steps. Then you would cycle [[1..N] -> o, [2..N, o] -> 1, [3..N, o, 1] -> 2, etc.]. Although it appears that for stencils currently only N=1,2 is available. For non stencil operations and N=1 it would also allow modifying the array in place. So a somewhat higher level function could take inputs and maybe an action to execute inbetween steps and return the result.

Anyway given that with a C-like interface it is trivial to implement such schemes (or maybe I'm missing something) I'd suggest exporting appropriate building blocks. Especially that it will probably be rather difficult if not impossible to anticipate all possible schemes involving mutation and put it in a pure, non monadic interface.

As for a concrete example -- any iterated kernel would probably benefit from this approach. My code is solving a 3+1D PDE for relativistic hydrodynamics and is rather crappy atm (thrown together in a couple of days), but if you wish so I can send it to you.

JS

mchakravarty commented 11 years ago

Anyway given that with a C-like interface it is trivial to implement such schemes (or maybe I'm missing something) I'd suggest exporting appropriate building blocks. Especially that it will probably be rather difficult if not impossible to anticipate all possible schemes involving mutation and put it in a pure, non monadic interface.

Possibly. However, this is what we do (and the whole point of Accelerate): to get out of the morass of imperative stateful interfaces and to find suitable high-level abstractions.

Having said that, we (to be precise @robeverest) recently added support for a foreign interface to use native CUDA code within Accelerate. We might want to think about the converse, too. To use Accelerate computations from native CUDA. @robeverest, what do you think?

Nevertheless, even if we can "foreign export" Accelerate computations to CUDA, I'd still still like to find a high-level pattern for your use case, because, I think, it is a common one, worth abstracting.

@tmcdonell What do you think? Would that situation fit into any of the thoughts we had about a combinator for iteration? In a sense, maybe, we'd like a kind of converge operator that repeats a stencil (or similar) until a condition is satisfied.

js6i commented 11 years ago

Fair enough. I agree that such abstraction is desirable. So the most pragmatic question: can I somehow achieve (even with an ugly hack) this effect quickly without major changes to accelerate? My stuff is a university project and also topic for my master thesis so I'd like to advertise haskell and accelerate a bit -- accelerate kernel is as fast or even faster than our current C++ implementation, and due to JIT compiling there are more nice runtime features as well.

JS

mchakravarty commented 11 years ago

@tmcdonell Have you got any idea for quick hack to get @js6i going?

tmcdonell commented 11 years ago

Hi Jan, sorry for the slow reply...

To try and narrow down the problem a bit first:

It looks like for each step of your solver you need to iterate a bit until convergence, and during the converge iteration is when you would have done the pointer swapping technique, is that that correct? I'm wondering how your program is set up at the moment, and specifically how times you call run*. Note that run* will copy its result back to the host, so if this is part of the convergence iteration that is not going to be good. We could provide a version of run* that doesn't do the final copy, but it would be better if the Accelerate program fired off steps number of kernels for each call to run*.

If your program is the latter (single call to run* attempts to launch many kernels) then the problem really is with device-side memory allocation. I noticed this recently actually; it seems that sometimes the CUDA API will pause for a while trying to allocate memory; running your program through the CUDA profiler will tell you if this is happening in your program. I think it wouldn't be too hard to be a bit cleverer with memory de/allocation --- basically, don't free device memory straight away, but keep it in a "nursery" of sorts so it can be reused instead of allocating a fresh array every time.

If you could provide a (preferably trimmed down) example program, I think that would help a lot.

Additionally, using run1 in your program instead of run is a lot quicker if you are applying the same computation (which it sounds like you are doing).

js6i commented 11 years ago

Hi,

OK so I made a little example. Here it is: http://www.fileswap.com/dl/EW0i7HOfty/

So it basically does 2 loops -- one inside, and one outside run1. Results are such that outside loop takes ~1.5 or so times longer to run than steps x kernel time due to (mainly) memory copying & reallocation. On the other hand the big kernel with inside loop has a rather big gap before executing, and then proceeds as expected (faster then outside, but still with memory reallocation). It's not compilation time, since next execution of the same big kernel on different data has this gap again.

So in my program it's probably the combination of the two -- my kernel is quite big (and does some iteration inside if it's relevant), and I iterate it outside of run1 (for time propagation of the system).

Btw. sorry for duplicate post on the mailing list -- I though that it got deleted and would not appear.

JS

tmcdonell commented 11 years ago

OK so I made a little example. Here it is: http://www.fileswap.com/dl/EW0i7HOfty/

Great, thanks!

So it basically does 2 loops -- one inside, and one outside run1. Results are such that outside loop takes ~1.5 or so times longer to run than steps x kernel time due to (mainly) memory copying & reallocation. On the other hand the big kernel with inside loop has a rather big gap before executing, and then proceeds as expected (faster then outside, but still with memory reallocation). It's not compilation time, since next execution of the same big kernel on different data has this gap again.

I'm currently working on the second problem here; long startup times before anything gets executed. I've pretty much got that problem nailed down, but need to update the CUDA backend to work with the changes. That at should be done pretty soon though (famous last words), so you'll at least have one version that works quickly(-ish). I'll let you know when that's done.

So in my program it's probably the combination of the two -- my kernel is quite big (and does some iteration inside if it's relevant), and I iterate it outside of run1 (for time propagation of the system).

Btw. sorry for duplicate post on the mailing list -- I though that it got deleted and would not appear.

JS

— Reply to this email directly or view it on GitHub.

tmcdonell commented 11 years ago

I pushed some patches that attempt to reduce the number of allocations. For your example it doesn't change things too much, but once you scale up the number of iterations good things happen. For the fluid simulation program in accelerate-examples, which (from memory) computes 120 stencil operations per frame, the runtime drops ~20%.

If you compile accelerate-cuda in debug mode (cabal install -fdebug) you'll see it in action. The first call to loop1 will spit out lots of lines containing "malloc/new", and then subsequent calls should be all "malloc/nursery". The latter isn't calling out to the device to get fresh memory, it's using an old pointer.

Let me know how it goes for your program!

js6i commented 11 years ago

Hi,

I just tested your changes (thanks for that, nice work!) and confirmed that it works nicely for the example program.

Sadly it did not fix the problem in my real program (though the gap may be smaller now). I tried cranking iteration steps in the example and noticed that all the bad stuff comes back -- the long gap before execution, output array reallocation. At 1000 I even get CUDA out of memory error at some point. Do you have any idea where that may come from?

Another problem I noticed (not directly related to this one) is iteration deeper inside the kernel. In the example I made a little 10-step iteration of sin and cos. With 100 steps or so nvcc already takes ridiculously long to compile. I believe adding an iteration primitive was discussed somewhere -- is that still scheduled to be done?

JS

tmcdonell commented 11 years ago

I see what you mean about cranking up the iterations. As a workaround, if you change the definition of loop1 to

loop1 steps = run1 $ foldl1 (>->) (P.replicate steps update)

then I don't get the out of memory problem anymore and the program goes through a bit quicker. The (>->) operator can be used to separate parts of the program. We have that (acc1 >-> acc2) is similar to (acc1 . acc2), except that acc1 and acc2 will not share any subcomputations. This gives my crappy simplifier smaller chunks to work with (which will be fixed at some point… hopefully), but also helps the GC know when it can clear out old unused arrays.

An iteration primitive is still on the cards, yes, but I haven't had a chance to work on that in a while, sorry.