Open nyngwang opened 2 years ago
PyTorch caches CUDA memory to prevent repeated memory allocatation cost, you can get more information here:
https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management
In your case, the reserved bytes should be peak memory usage before checkpointing
, while active bytes should be the current memory usage after `checkpointing'
## VGG.forward
active_bytes reserved_bytes line code
all all
peak peak
5.71G 10.80G 50 @profile
51 def forward(self, x):
3.86G 8.77G 52 out = self.features(x)
2.19G 8.77G 53 out = self.classifier(out)
2.19G 8.77G 54 return out
@Stonesjtu Could you help me re-check the code above: I checkpointed the self.features
internally (which itself is a nn.Module
with nn.Sequential
inside) but added the @profile
decorator on the forward
method (as above) of the outer class that uses the features (conv2d layers)
Q1: Do you know how to explain this: If I keep the same batch-size, but change how I partition the self.features
internally (into checkpoint
ed segments), the active_bytes
of the next non-checkpointed line self.classifier(out)
also changed.
I also have two additional lines printed by the following code before the stats above printed:
Max CUDA memory allocated on forward: 1.22G
Max CUDA memory allocated on backward: 5.71G
which are generated by the code appended below.
Q2: So how to explain the reserved_bytes
, i.e. 10.80G
, 8.77G
, in the stats generated by pytorch_memlab above? Does it mean that pytorch internally allocates much more GPU memory than it really needs?
# compute output
if i < 1:
torch.cuda.reset_peak_memory_stats()
output = model(images)
loss = criterion(output, target)
if i < 1:
print('Max CUDA memory allocated on forward: ', utils.readable_size(torch.cuda.max_memory_allocated()))
# measure accuracy and record loss
acc1, acc5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.detach().item(), images.size(0))
top1.update(acc1[0], images.size(0))
top5.update(acc5[0], images.size(0))
# compute gradient and do SGD step
if i < 1:
torch.cuda.reset_peak_memory_stats()
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i < 1:
print('Max CUDA memory allocated on backward: ', utils.readable_size(torch.cuda.max_memory_allocated()))
Q1: Do you know how to explain this: If I keep the same batch-size, but change how I partition the self.features internally (into checkpointed segments), the active_bytes of the next non-checkpointed line self.classifier(out) also changed.
The column (or metric) active bytes peak all
is actually the peak active bytes during the execution of this line, it's an accumulated value which depends on the active bytes
before the execution of this line.
e.g. you have 4 Linear
layer in nn.Sequential
, checkpointing after layers[after] would consume less active bytes than checkpointing after layer[0].
Q2: So how to explain the reserved_bytes, i.e. 10.80G, 8.77G, in the stats generated by pytorch_memlab above? Does it mean that pytorch internally allocates much more GPU memory than it really needs?
According to the pytorch documentation:
PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi.
Actually it needs the cached memory at a certain point of execution, but at the time of your torch.cuda.max_memory_allocated
, it doesn't need so much memory space. You can try torch.cuda.empty_cache()
before getting torch.cuda.max_memory_allocated
.
I need to show that some technique called gradient checkpointing can really save GPU memory usage during backward propagation. When I see the result there are two columns on the left showing
active_bytes
andreserved_bytes
. In my testing, while active bytes read3.83G
, the reserved bytes read9.35G
. So why does PyTorch still reserve that much GPU memory?