Closed dan-zheng closed 6 years ago
I wonder if other LMS projects have solutions for managing memory/freeing? @TiarkRompf @GSAir
I'd prefer that we get the arena model working on GPU (no need to walk free lists etc, just free everything per epoch). As @GSAir indicated, the issue we need to figure out is probably alignment. Is there any documentation about how cudaMalloc works internally, or how other projects solve this?
A simple way to do investigate this would be to just print out all the addresses returned from cudaMalloc, along with the size requested.
I don't believe an arena model will solve the out of memory
error though? We'll still need some freeing mechanism.
EDIT: I missed the "free everything per epoch" part in your reply, sorry. To be precise, we should free everything except model parameters (weights and biases), our freeing mechanism must be able to handle that.
The whole arena is dumped after each epoch, so as long as we can sustain one epoch we won't have OOMs.
(I'm not even sure right now if storage is reclaimed per epoch or per minibatch - per minibatch seems smarter)
@feiwang3311 shared his thoughts on the arena model, which cleared up my confusion:
so here is how memory arena works.
we allocate a big chunk of memory as arena.
we allocate some for parameters (these need to be persistent), then we mark the current bound of used memory.
then we go into the training loop, which allocates more memory for intermediate values, workspaces and whatnot. But these are not persistent. So at the end of each loop, we reset the bound to our mark, and memset the memory in between (memory used for this loop) to 0.
that manages everything allocated by myGpuMalloc, but not if we used cudaMalloc (implicitly or otherwise)
The "memory bound" separating persistent/non-persistent memory is simple and seems robust enough for all our models. I think it's the right direction. We'll need to fix alignment issues and eliminate direct calls to cudaMalloc
.
To align on power of two boundary:
constexpr int N = 4; // 16
void* allocate(size_t nbytes) {
nbytes = ((nbytes + (1 << N) - 1) >> N) << N; // size_t is unsigned so >> is safe
...
}
Done in #38.
Currently, tensors are not freed.
This is especially problematic on GPU, where explicit calls to
cudaMalloc
leak memory. The MNIST CNN model crashes after two epochs due toCUDA error occurred: out of memory
.We need some mechanism to free tensors. Ideas:
{ cudnnTensorDescriptor_t x_desc; ... }
. At the end of these scopes, it is valid to free all descriptors/tensors initialized within the scope.