Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.53k stars 3.39k forks source link

Increase in GPU memory usage with Pytorch-Lightning #1376

Closed VitorGuizilini closed 4 years ago

VitorGuizilini commented 4 years ago

Over the last week I have been porting my code on monocular depth estimation to Pytorch-Lightning, and everything is working perfectly. However, my models seem to require more GPU memory than before, to the point where I need to significantly decrease batch size at training time. These are the Trainer parameters I am using, and relevant versions:

FROM nvidia/cuda:10.1-devel-ubuntu18.04
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV NCCL_VERSION=2.4.8-1+cuda10.1
ENV PYTORCH_LIGHTNING_VERSION=0.7.1
cfg.arch.gpus = 8
cfg.arch.num_nodes = 1
cfg.arch.num_workers = 8
cfg.arch.distributed_backend = 'ddp'
cfg.arch.amp_level = 'O0'
cfg.arch.precision = 32
cfg.arch.benchmark = True 
cfg.arch.min_epochs = 1
cfg.arch.max_epochs = 50
cfg.arch.checkpoint_callback = False
cfg.arch.callbacks = []
cfg.arch.gradient_clip_val = 0.0
cfg.arch.accumulate_grad_batches = 1
cfg.arch.val_check_interval = 1.0
cfg.arch.check_val_every_n_epoch = 1
cfg.arch.num_sanity_val_steps = 0
cfg.arch.progress_bar_refresh_rate = 1
cfg.arch.fast_dev_run = False
cfg.arch.overfit_pct = 0.0
cfg.arch.train_percent_check = 1.0
cfg.arch.val_percent_check = 1.0
cfg.arch.test_percent_check = 1.0

Because of that (probably) I am having issues replicating my results, could you please advise on possible solutions? I will open-source the code as soon as I manage to replicate current results.

github-actions[bot] commented 4 years ago

Hi! thanks for your contribution!, great first issue!

Borda commented 4 years ago

Hi @vguizilini could you be more specific how much more memory is required?

williamFalcon commented 4 years ago

@jeremyjordan can we get that memory profiler? @vguizilini mind trying again from master?

jeremyjordan commented 4 years ago

i thought we already log GPU mem usage?

https://pytorch-lightning.readthedocs.io/en/0.7.1/debugging.html#log-gpu-usage

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/logging.py#L55

VitorGuizilini commented 4 years ago

Memory usage for my original implementation (horovod for distributed training)

image

Memory usage for my Pytorch-Lightning implementation (ddp)

image

I'm loading the same configuration and same networks in both. I'm still learning to use Pytorch-Lightning, what should I profile next?

jeremyjordan commented 4 years ago

@neggert or @williamFalcon any ideas why GPU memory isn't consistent across the nodes?

VitorGuizilini commented 4 years ago

Following up on this issue, is there anything else I should provide to facilitate debugging?