CUDA Exception: out of memory

robstewart57 commented 10 years ago

How do I overcome the following error that my accelerate-cuda using code currently throws. Is there a standard protocol for resolving this, or is it a more serious issue to do with my setup?

*** Internal error in package accelerate ***
*** Please submit a bug report at
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA/State.hs:86 (unhandled): CUDA Exception:
out of memory

     544,868,344 bytes allocated in the heap
       1,047,392 bytes copied during GC
     180,408,904 bytes maximum residency (15 sample(s))
       7,178,192 bytes maximum slop
             448 MB total memory in use (144 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0        97 colls,    96 par    0.00s    0.01s     0.0001s    0.0012s
  Gen  1        15 colls,    13 par    0.05s    0.06s     0.0039s    0.0142s

  Parallel GC work balance: 29.20% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 4 peak workers (5 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    5.42s  (  4.31s elapsed)
  GC      time    0.05s  (  0.07s elapsed)
  EXIT    time    0.00s  (  0.02s elapsed)
  Total   time    5.49s  (  4.40s elapsed)

  Alloc rate    100,455,078 bytes per MUT second

  Productivity  99.0% of total user, 123.4% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 24

real    0m4.426s
user    0m5.487s
sys     0m1.938s

tmcdonell commented 10 years ago

The error simply indicates that you've run out of memory on your GPU.

At the moment, there isn't a whole lot you can do about this. We rely on GHC's garbage collector to tell us when we can free up arrays on the device, so if your code is keeping large arrays in memory for long periods of time, they will remain active on the GPU as well. We could provide our own GPU-aware garbage collector, as GPU memories are typically much smaller than the host CPU memory, but currently there is no such thing...

Sometimes setting the maximum heap size to match the size of your GPU memory can encourage GHC to deallocate stuff sooner. I forget what the RTS flag for that is though.

If your GPU is shared with the OS (GUI, etc), sometimes closing other applications can free up a tiny bit of memory (or depending on your system, switch to the integrated GPU and back to discrete GPU will force the OS to clear out a bunch of memory).

kdmadej commented 10 years ago

Is there any ETA for the GPU-aware garbage collector? I'm processing images in an environment similar to ghci and each computation uses up more and more memory, which gets released only when the ghci session ends.

You can observe similar behaviour using nvidia-smi and running the following example in ghci :

import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA as C

let a = [1..1000000] :: [Double]

-- repeat this part (the memory usage on the GPU keeps on increasing)
let a' = use $ fromList (Z:.1000000) a :: Acc (Vector Double)
seq (C.run $ A.map (+1) a') ()

While in this simple case the problem doesn't seem to be that big, when processing 4k images (especially convolution) the memory usage can increase by a few hundred megabytes per run. just a few of those and even high-end GPUs might run out of memory.

Is there any chance for the Accelerate GPU GC routine to be implemented any time soon? If not, then is it possible to add a functionality allowing me to manually free all the memory used so far? This is something crucial for me since with the things being this way, the whole application is somewhat unusable and would require a major rewrite (if that's even possible).

tmcdonell commented 10 years ago

I currently have no ETA for this issue, sorry.

If the source array Haskell-side disappears, then doing a performGC should force the associated device array to be deallocated. But if those arrays are still around on the host, even if you know you won't need it anymore, then unfortunately there is currently no way to purge them from the GPU.

mchakravarty commented 10 years ago

@zgredzik The distinction that @tmcdonell makes is very important (and it is not quite clear to me from your question which case you are looking at). Considering the arrays that lead to memory exhaustion in your application, are these arrays (1) truly garbage (i.e., they will never be used again) or are they (1) only temporarily unused (but they'll be processed on the GPU again later)?

If you are looking at Case (1), then performGC or maybe some other tricks should help. However, if you are looking at Case (2), then the situation is more complicated and changes to Accelerate would be required to solve your problem.

pmlodawski commented 10 years ago

Ok, but when we modify a bit @zgredzik's example, we can make sure that arrays lose their scope and should be GC'd. But it didn't happen, see example:

$ ghci
> -- Graphic card memory usage: 156MiB
> import Data.Array.Accelerate as A
> import Data.Array.Accelerate.CUDA as C
> import System.Mem
> 
> seq (C.run $ A.map (+1) (use $ fromList (Z:.10000000) [1..10000000] :: Acc (Vector Double))) ()
()
> -- Graphic card memory usage: 335MiB
> performGC
> -- Graphic card memory usage: 335MiB (still)
> :show bindings 
it :: () = ()
> -- there are no bindings, but there are still objects stuck on GPU

Is there any way to release this memory? This is REALLY crucial to us.

kdmadej commented 9 years ago

@mchakravarty they are truly garbage, however the device memory is not released even after performGC

The following example shows that when running performGC the host memory is released (you can observe a drastic drop in memory usage with the ekg monitor) but the device memory usage stays the same (checked with nvidia-smi -l 1).

:set -XOverloadedStrings
import System.Mem as M
import System.Remote.Monitoring as RM
forkServer "localhost" 8000
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA as C

C.run $ A.maximum $ A.map (+1) $ A.use (fromList (Z:.10000000) [1..10000000] :: Vector Double)
-- here the memory usage increases
performGC
-- and here it drops only on the host

We are 100% certain that we will not be using the same data that was already uploaded to the device ever again after finishing the computations. Is adding a function that, when called, would free the device memory even an option?

tmcdonell commented 9 years ago

I have a couple of spare cycles now so will try and look into it.

At a thought experiment for an explicit free, the memory manager just has to be told that when the finaliser later runs but the memory has already been (explicitly) removed, that is not an error.

mchakravarty commented 9 years ago

@zgredzik I find this quite strange that the GPU memory isn't freed even after the host arrays have been deallocated.

@tmcdonell Do you think, it is the finalisers getting delayed? They are not guaranteed to run right away, but there shouldn't be a long delay either. Is there any easy way in which we can check how much time passes between host array deallocation and the execution of the finaliser?

If finalisers are really delayed, one possible hack would be to keep a set of weak pointers to all host arrays together with the address of the device array. The weak pointer API allows to check whether a weak pointer became garbage, which would indicate that the device array needs to be released (if it hasn't been yet).

tmcdonell commented 9 years ago

@mchakravarty I am not sure if they are getting delayed, but certainly the strategy you mention would give an Accelerate GC some more control over what happens and is a good place to start.

pmlodawski commented 9 years ago

Is there anything new on this topic? Is there ANY way we can release this memory?

tmcdonell commented 9 years ago

Hi @remdezx @zgredzik. The free cycles I thought I had got taken up by getting sick and then a conference, but I still have this next on the TODO list.

wdanilo commented 9 years ago

Hello Trevor, Manuel, Rob ! :)

I'm writing to you guys because I'm a little worried about Konrad and Piotrek (zgredzik and remdezx) - they are fighting with this problem really strong - we are just about to release the software and facing deadlines here. I would love to ask you, if there is any possibility, to just set a date, when this issue will be fixed? I know this is sometimes hard to estimate and sometimes even hard to predict, but I'm also scratching my head what to do in this situation. You know - now when we are processing images, after few operations, the card memory is fcompletely filled and the software stopps working :( Additional - on the beginning of next week, we've got presentation on siggraph, and this issue is just a killer one for us in this situation :(

I would be very, very thankful if you find a solution and help us with this issue. Thank you once again, Wojtek

Sun Nov 23 2014 at 5:14:58 PM użytkownik Trevor L. McDonell < notifications@github.com> napisał:

Hi @remdezx https://github.com/remdezx @zgredzik https://github.com/zgredzik. The free cycles I thought I had got taken up by getting sick and then a conference, but I still have this next on the TODO list.

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/165#issuecomment-64123191 .

tmcdonell commented 9 years ago

We talked about having some very low-level unsafeFree operation, so that you can explicitly delete arrays from device memory. I'll try implementing that tomorrow.

I'm not sure why the finalisers aren't firing properly, so that might take longer to debug. Hopefully the first hack will get things moving for you though.

wdanilo commented 9 years ago

Trevor that sounds great! Thank you very, very much for your help! I really appreciate it much and owe you not one beer. I hope in a near future there will be possibility to buy you at last this beer in reality :)

All the best, Wojciech

Tue Nov 25 2014 at 12:04:02 AM użytkownik Trevor L. McDonell < notifications@github.com> napisał:

We talked about having some very low-level unsafeFree operation, so that you can explicitly delete arrays from device memory. I'll try implementing that tomorrow.

I'm not sure why the finalisers aren't firing properly, so that might take longer to debug. Hopefully the first hack will get things moving for you though.

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/165#issuecomment-64281637 .

pmlodawski commented 9 years ago

@tmcdonell, sounds great! Thank you very much!

tmcdonell commented 9 years ago

I'll just add something that I noticed so far, which is that the examples above are actually working as expected.

The reason is that Accelerate does not deallocate the memory immediately, but instead moves it into a holding area (the nursery) so that chunk of memory can be reused. I found this to have significant performance improvements in practice, but if you need to interact with other programs that also want a piece of the GPU memory, then maybe this would be a problem.

ghci> :set args -ddump-gc
ghci> seq (CUDA.run $ A.map (+1) (use $ fromList (Z:.10000000) [1..10000000] :: Acc (Vector Double))) ()
38.73:gc: initialise default context
38.77:gc: initialise context #0x00007f9363040600
38.77:gc: push context: #0x00007f9363040600
78.69:gc: initialise CUDA state
78.69:gc: initialise memory table
78.69:gc: lookup/not found: Array #6
78.69:gc: malloc/new
78.69:gc: insert: Array #6
78.76:gc: useArrayAsync/malloc: 76 MB @ 1.133 GB/s, gpu: 65.785 ms, cpu: 65.469 ms
78.76:gc: lookup/not found: Array #5
78.76:gc: mallocArray: 76 MB
78.76:gc: malloc/new
78.76:gc: insert: Array #5
78.76:gc: lookup/found: Array #6
78.76:gc: lookup/found: Array #5
78.78:gc: lookup/found: Array #5
78.84:gc: peekArray: 76 MB @ 1.229 GB/s, gpu: 60.622 ms, cpu: 60.751 ms
78.84:gc: pop context: #0x00007f9363040600
()
it :: ()

80.28:gc: finalise/nursery: Array #5
80.28:gc: finalise/nursery: Array #6

When I run this, the last lines (finalise/nursery) appear just after the ghci prompt returns. At that point the two arrays (the input to use and the result of the map) are effectively deallocated, even though the total device memory in use stays the same. If you the expression again, you will see malloc/nursery instead of malloc/new in the trace,

So, perhaps there is another use case here as well, and we will have:

unsafeFree as mentioned above, to deallocate arrays from the device even if they are still active on the host (or rather, move them to the nursery as is currently done).
clear out the nursery, in case you want to interact with other applications that need device memory

tmcdonell commented 9 years ago

The top level CUDA module now exports two functions, unsafeFreeArrays will release the GPU memory for the given array(s), even if they are still active on the host. I tested this a little in ghci, and hopefully works for you as well. This tuned out to be less hassle than I anticipated.

The second function is performGC, which clears out the nursery and returns as much GPU memory to the system as possible. You shouldn't need to worry about this unless you are interfacing with other applications that need the GPU.

tmcdonell commented 9 years ago

Oh, I should mention that at the moment this is only using the default context. If you are specifying your own execution contexts (runIn and friends) then we'll need to change that, but let's test first whether or not these changes work for you.

pmlodawski commented 9 years ago

Thanks a lot! It gives us much more flexibility with memory management! Now we are ready to test if there are no other memory leaks in our application.

Thanks also for ghci example explaining the nursery!

tmcdonell commented 9 years ago

In the latest commit I renamed unsafeFreeArrays to simply unsafeFree, and added versions to specify the context. I forgot to tag this issue in the commit message though.

wdanilo commented 9 years ago

Trevor thank you so much for the help! :)

Wed Nov 26 2014 at 6:22:00 PM użytkownik Trevor L. McDonell < notifications@github.com> napisał:

In the latest commit I renamed unsafeFreeArrays to simply unsafeFree, and added versions to specify the context. I forgot to tag this issue in the commit message though.

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/165#issuecomment-64680442 .

tmcdonell commented 9 years ago

No problem! Sorry it took so long to get a patch to you guys, I hope you manage to get everything working in time (:

tmcdonell commented 6 years ago

@robeverest eventually got the LRU memory manager online, so I think this is fixed.

AccelerateHS / accelerate

CUDA Exception: out of memory #165