feiwang3311 / Lantern

BSD 3-Clause "New" or "Revised" License
168 stars 15 forks source link

Fix GPU out-of-memory errors. #36

Closed dan-zheng closed 6 years ago

dan-zheng commented 6 years ago

Currently, tensors are not freed.

This is especially problematic on GPU, where explicit calls to cudaMalloc leak memory. The MNIST CNN model crashes after two epochs due to CUDA error occurred: out of memory.

We need some mechanism to free tensors. Ideas:

dan-zheng commented 6 years ago

I wonder if other LMS projects have solutions for managing memory/freeing? @TiarkRompf @GSAir

TiarkRompf commented 6 years ago

I'd prefer that we get the arena model working on GPU (no need to walk free lists etc, just free everything per epoch). As @GSAir indicated, the issue we need to figure out is probably alignment. Is there any documentation about how cudaMalloc works internally, or how other projects solve this?

TiarkRompf commented 6 years ago

A simple way to do investigate this would be to just print out all the addresses returned from cudaMalloc, along with the size requested.

dan-zheng commented 6 years ago

I don't believe an arena model will solve the out of memory error though? We'll still need some freeing mechanism.

EDIT: I missed the "free everything per epoch" part in your reply, sorry. To be precise, we should free everything except model parameters (weights and biases), our freeing mechanism must be able to handle that.

TiarkRompf commented 6 years ago

The whole arena is dumped after each epoch, so as long as we can sustain one epoch we won't have OOMs.

TiarkRompf commented 6 years ago

(I'm not even sure right now if storage is reclaimed per epoch or per minibatch - per minibatch seems smarter)

dan-zheng commented 6 years ago

@feiwang3311 shared his thoughts on the arena model, which cleared up my confusion:

so here is how memory arena works.

we allocate a big chunk of memory as arena.

we allocate some for parameters (these need to be persistent), then we mark the current bound of used memory.

then we go into the training loop, which allocates more memory for intermediate values, workspaces and whatnot. But these are not persistent. So at the end of each loop, we reset the bound to our mark, and memset the memory in between (memory used for this loop) to 0.

that manages everything allocated by myGpuMalloc, but not if we used cudaMalloc (implicitly or otherwise)

The "memory bound" separating persistent/non-persistent memory is simple and seems robust enough for all our models. I think it's the right direction. We'll need to fix alignment issues and eliminate direct calls to cudaMalloc.

GSAir commented 6 years ago

To align on power of two boundary:

constexpr int N = 4; // 16
void* allocate(size_t nbytes) {
  nbytes = ((nbytes + (1 << N) - 1) >> N) << N; // size_t is unsigned so >> is safe
  ...
}
dan-zheng commented 6 years ago

Done in #38.