How to have a stable GPU memory while being performant?

ByzanTine commented 5 years ago

I have just started training neural networks for a few days on Julia, so bear with me for trivial mistakes.

I am trying to train a ResNet18 on cifar10 with Flux/Julia. Here is a snippet of my code:

x = rand(32, 32, 3, 128) |> gpu
y = rand(10, 128) |> gpu

model = resnet18()
loss(x, y) = crossentropy(model(x), y)

for _ in 1:500
  loss_item = loss(x, y) 
  back!(loss_item)
  # Whatever update
  # GC.gc()
end

This basically creates random x, y and compute forward and backward passes (I didn't use train!() because I do need to record training loss).

The runnable code that didn't call manual GC is res_auto_gc.jl.

And the code that calls GC.gc() manually is manual_gc.jl.

(I used 500 as iteration number because 500 * 128 is about the dataset size of cifar10)

The one that has GC.gc() commented out have the time footprint like Auto GC 92.990018 seconds (77.52 M allocations: 3.858 GiB, 2.80% GC time) The maximum GPU memory usage is as high as 12G in a K80.

The one that turned on GC.gc() has the footprint like Manual GC 125.875518 seconds (77.12 M allocations: 3.853 GiB, 59.20% GC time) The maximum GPU memory usage is about 1200M in a K80.

Comparing our Pytorch benchmark, the time is about 60s (time is not directly comparable as there is probably bootstrap overhead) and 1700M in a K80 (one epoch).

Is there anything I can improve on my code so that the memory consumption can be stable while I can have relatively similar speed behavior as common frameworks?

(I assume this is an inherent design choice that the underlying auto diff package didn't do referencing counting as Pytorch did?)

MikeInnes commented 5 years ago

Is the memory overhead actually causing a problem, or is it just concerning? Basically our current allocator is quite greedy and like other systems will reserve a large block of memory even if it's not using all of it at once. So it doesn't necessarily impact the batch sizes you can achieve, just shows up on nvidia-smi and such.

ByzanTine commented 5 years ago

Batch size is not quite our concern in this case. The thing is, say when we use ResNet18 + cifar10, we have the expectation that one training run should take 2000M of the GPU, so on a typical GPU with 10G mem, we can run 5 jobs. Now, I assume we couldn't.

(A potentially helpful context, we are benchmarking NeuralODE vs ResNet. The major selling point of NeuralODE is memory consumption, so if the memory consumption is unstable, it's very hard for us to give fair comparisons.)

Still, I am trying to understand what's the problem here. By looking at the GPU usage pattern, it seems to me after each batch of training (forward + backward) passes, flux/flux's autodiff doesn't release any memory? I am trying to understand when does it release memory.

Based on what I read from pytorch autodiff paper, I thought they intentionally free intermedia results while executing. Has Flux/Flux's autodiff library considered such feature when being designed?

MikeInnes commented 5 years ago

Yeah, Flux does release memory during the backwards pass, but unfortunately as Julia isn't refcounted it's not guaranteed that the memory will be immediately made available for reuse. We are hoping to have more tools for this kind of memory debugging though.

ByzanTine commented 5 years ago

I guess then my question would be: Is it worthwhile to reinvent some aspects of Flux to gain ref-counting? Or it's probably more about an auto diff library?

ToucheSir commented 3 years ago

Copying my comment from https://github.com/FluxML/Flux.jl/issues/828, this would benefit from testing with the latest version of CUDA. Otherwise I don't think there's much actionable stuff to be had on the Flux side of things.

CarloLucibello commented 6 months ago

closing as very old

FluxML / Flux.jl

How to have a stable GPU memory while being performant? #780