Closed robstewart57 closed 6 years ago
The error simply indicates that you've run out of memory on your GPU.
At the moment, there isn't a whole lot you can do about this. We rely on GHC's garbage collector to tell us when we can free up arrays on the device, so if your code is keeping large arrays in memory for long periods of time, they will remain active on the GPU as well. We could provide our own GPU-aware garbage collector, as GPU memories are typically much smaller than the host CPU memory, but currently there is no such thing...
Sometimes setting the maximum heap size to match the size of your GPU memory can encourage GHC to deallocate stuff sooner. I forget what the RTS flag for that is though.
If your GPU is shared with the OS (GUI, etc), sometimes closing other applications can free up a tiny bit of memory (or depending on your system, switch to the integrated GPU and back to discrete GPU will force the OS to clear out a bunch of memory).
Is there any ETA for the GPU-aware garbage collector? I'm processing images in an environment similar to ghci and each computation uses up more and more memory, which gets released only when the ghci session ends.
You can observe similar behaviour using nvidia-smi
and running the following example in ghci :
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA as C
let a = [1..1000000] :: [Double]
-- repeat this part (the memory usage on the GPU keeps on increasing)
let a' = use $ fromList (Z:.1000000) a :: Acc (Vector Double)
seq (C.run $ A.map (+1) a') ()
While in this simple case the problem doesn't seem to be that big, when processing 4k images (especially convolution) the memory usage can increase by a few hundred megabytes per run
. just a few of those and even high-end GPUs might run out of memory.
Is there any chance for the Accelerate GPU GC routine to be implemented any time soon? If not, then is it possible to add a functionality allowing me to manually free all the memory used so far? This is something crucial for me since with the things being this way, the whole application is somewhat unusable and would require a major rewrite (if that's even possible).
I currently have no ETA for this issue, sorry.
If the source array Haskell-side disappears, then doing a performGC
should force the associated device array to be deallocated. But if those arrays are still around on the host, even if you know you won't need it anymore, then unfortunately there is currently no way to purge them from the GPU.
@zgredzik The distinction that @tmcdonell makes is very important (and it is not quite clear to me from your question which case you are looking at). Considering the arrays that lead to memory exhaustion in your application, are these arrays (1) truly garbage (i.e., they will never be used again) or are they (1) only temporarily unused (but they'll be processed on the GPU again later)?
If you are looking at Case (1), then performGC
or maybe some other tricks should help. However, if you are looking at Case (2), then the situation is more complicated and changes to Accelerate would be required to solve your problem.
Ok, but when we modify a bit @zgredzik's example, we can make sure that arrays lose their scope and should be GC'd. But it didn't happen, see example:
$ ghci
> -- Graphic card memory usage: 156MiB
> import Data.Array.Accelerate as A
> import Data.Array.Accelerate.CUDA as C
> import System.Mem
>
> seq (C.run $ A.map (+1) (use $ fromList (Z:.10000000) [1..10000000] :: Acc (Vector Double))) ()
()
> -- Graphic card memory usage: 335MiB
> performGC
> -- Graphic card memory usage: 335MiB (still)
> :show bindings
it :: () = ()
> -- there are no bindings, but there are still objects stuck on GPU
Is there any way to release this memory? This is REALLY crucial to us.
@mchakravarty they are truly garbage, however the device memory is not released even after performGC
The following example shows that when running performGC
the host memory is released (you can observe a drastic drop in memory usage with the ekg monitor) but the device memory usage stays the same (checked with nvidia-smi -l 1
).
:set -XOverloadedStrings
import System.Mem as M
import System.Remote.Monitoring as RM
forkServer "localhost" 8000
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA as C
C.run $ A.maximum $ A.map (+1) $ A.use (fromList (Z:.10000000) [1..10000000] :: Vector Double)
-- here the memory usage increases
performGC
-- and here it drops only on the host
We are 100% certain that we will not be using the same data that was already uploaded to the device ever again after finishing the computations. Is adding a function that, when called, would free the device memory even an option?
I have a couple of spare cycles now so will try and look into it.
At a thought experiment for an explicit free
, the memory manager just has to be told that when the finaliser later runs but the memory has already been (explicitly) removed, that is not an error.
@zgredzik I find this quite strange that the GPU memory isn't freed even after the host arrays have been deallocated.
@tmcdonell Do you think, it is the finalisers getting delayed? They are not guaranteed to run right away, but there shouldn't be a long delay either. Is there any easy way in which we can check how much time passes between host array deallocation and the execution of the finaliser?
If finalisers are really delayed, one possible hack would be to keep a set of weak pointers to all host arrays together with the address of the device array. The weak pointer API allows to check whether a weak pointer became garbage, which would indicate that the device array needs to be released (if it hasn't been yet).
@mchakravarty I am not sure if they are getting delayed, but certainly the strategy you mention would give an Accelerate GC some more control over what happens and is a good place to start.
Is there anything new on this topic? Is there ANY way we can release this memory?
Hi @remdezx @zgredzik. The free cycles I thought I had got taken up by getting sick and then a conference, but I still have this next on the TODO list.
Hello Trevor, Manuel, Rob ! :)
I'm writing to you guys because I'm a little worried about Konrad and Piotrek (zgredzik and remdezx) - they are fighting with this problem really strong - we are just about to release the software and facing deadlines here. I would love to ask you, if there is any possibility, to just set a date, when this issue will be fixed? I know this is sometimes hard to estimate and sometimes even hard to predict, but I'm also scratching my head what to do in this situation. You know - now when we are processing images, after few operations, the card memory is fcompletely filled and the software stopps working :( Additional - on the beginning of next week, we've got presentation on siggraph, and this issue is just a killer one for us in this situation :(
I would be very, very thankful if you find a solution and help us with this issue. Thank you once again, Wojtek
Sun Nov 23 2014 at 5:14:58 PM użytkownik Trevor L. McDonell < notifications@github.com> napisał:
Hi @remdezx https://github.com/remdezx @zgredzik https://github.com/zgredzik. The free cycles I thought I had got taken up by getting sick and then a conference, but I still have this next on the TODO list.
— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/165#issuecomment-64123191 .
We talked about having some very low-level unsafeFree
operation, so that you can explicitly delete arrays from device memory. I'll try implementing that tomorrow.
I'm not sure why the finalisers aren't firing properly, so that might take longer to debug. Hopefully the first hack will get things moving for you though.
Trevor that sounds great! Thank you very, very much for your help! I really appreciate it much and owe you not one beer. I hope in a near future there will be possibility to buy you at last this beer in reality :)
All the best, Wojciech
Tue Nov 25 2014 at 12:04:02 AM użytkownik Trevor L. McDonell < notifications@github.com> napisał:
We talked about having some very low-level unsafeFree operation, so that you can explicitly delete arrays from device memory. I'll try implementing that tomorrow.
I'm not sure why the finalisers aren't firing properly, so that might take longer to debug. Hopefully the first hack will get things moving for you though.
— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/165#issuecomment-64281637 .
@tmcdonell, sounds great! Thank you very much!
I'll just add something that I noticed so far, which is that the examples above are actually working as expected.
The reason is that Accelerate does not deallocate the memory immediately, but instead moves it into a holding area (the nursery) so that chunk of memory can be reused. I found this to have significant performance improvements in practice, but if you need to interact with other programs that also want a piece of the GPU memory, then maybe this would be a problem.
ghci> :set args -ddump-gc
ghci> seq (CUDA.run $ A.map (+1) (use $ fromList (Z:.10000000) [1..10000000] :: Acc (Vector Double))) ()
38.73:gc: initialise default context
38.77:gc: initialise context #0x00007f9363040600
38.77:gc: push context: #0x00007f9363040600
78.69:gc: initialise CUDA state
78.69:gc: initialise memory table
78.69:gc: lookup/not found: Array #6
78.69:gc: malloc/new
78.69:gc: insert: Array #6
78.76:gc: useArrayAsync/malloc: 76 MB @ 1.133 GB/s, gpu: 65.785 ms, cpu: 65.469 ms
78.76:gc: lookup/not found: Array #5
78.76:gc: mallocArray: 76 MB
78.76:gc: malloc/new
78.76:gc: insert: Array #5
78.76:gc: lookup/found: Array #6
78.76:gc: lookup/found: Array #5
78.78:gc: lookup/found: Array #5
78.84:gc: peekArray: 76 MB @ 1.229 GB/s, gpu: 60.622 ms, cpu: 60.751 ms
78.84:gc: pop context: #0x00007f9363040600
()
it :: ()
80.28:gc: finalise/nursery: Array #5
80.28:gc: finalise/nursery: Array #6
When I run this, the last lines (finalise/nursery
) appear just after the ghci prompt returns. At that point the two arrays (the input to use
and the result of the map
) are effectively deallocated, even though the total device memory in use stays the same. If you the expression again, you will see malloc/nursery
instead of malloc/new
in the trace,
So, perhaps there is another use case here as well, and we will have:
unsafeFree
as mentioned above, to deallocate arrays from the device even if they are still active on the host (or rather, move them to the nursery as is currently done).The top level CUDA module now exports two functions, unsafeFreeArrays
will release the GPU memory for the given array(s), even if they are still active on the host. I tested this a little in ghci, and hopefully works for you as well. This tuned out to be less hassle than I anticipated.
The second function is performGC
, which clears out the nursery and returns as much GPU memory to the system as possible. You shouldn't need to worry about this unless you are interfacing with other applications that need the GPU.
Oh, I should mention that at the moment this is only using the default context. If you are specifying your own execution contexts (runIn
and friends) then we'll need to change that, but let's test first whether or not these changes work for you.
Thanks a lot! It gives us much more flexibility with memory management! Now we are ready to test if there are no other memory leaks in our application.
Thanks also for ghci example explaining the nursery!
In the latest commit I renamed unsafeFreeArrays
to simply unsafeFree
, and added versions to specify the context. I forgot to tag this issue in the commit message though.
Trevor thank you so much for the help! :)
Wed Nov 26 2014 at 6:22:00 PM użytkownik Trevor L. McDonell < notifications@github.com> napisał:
In the latest commit I renamed unsafeFreeArrays to simply unsafeFree, and added versions to specify the context. I forgot to tag this issue in the commit message though.
— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/165#issuecomment-64680442 .
No problem! Sorry it took so long to get a patch to you guys, I hope you manage to get everything working in time (:
@robeverest eventually got the LRU memory manager online, so I think this is fixed.
How do I overcome the following error that my accelerate-cuda using code currently throws. Is there a standard protocol for resolving this, or is it a more serious issue to do with my setup?