I'm implementing CycleGAN with Lightning. I use PSNR and SSIM from torchmetrics for evaluation.
During training, I see that my GPU ram memory increases non stop until overflow and the whole training shuts down.
This might similar to https://github.com/Lightning-AI/torchmetrics/issues/2481
which is weird because metrics aren't supposed to be attached to computational graph.
When verifying valid_metrics, I don't see grad_fn.
Guessing that's the issue, I tried to call fake.detach() when computing train_metrics.
Now the training is stable, the GPU memory stops increasing non stop.
🐛 Bug
Hello all,
I'm implementing CycleGAN with Lightning. I use PSNR and SSIM from torchmetrics for evaluation. During training, I see that my GPU ram memory increases non stop until overflow and the whole training shuts down. This might similar to https://github.com/Lightning-AI/torchmetrics/issues/2481
To Reproduce
Add this to init method of model class:
In training_step method:
train_metrics = self.train_metrics(fake, real)
In validation_step method:
valid_metrics = self.valid_metrics(fake, real)
Environment
Easy fix proposition
I try to debug the code. When verifying train_metrics, I get this:
which is weird because metrics aren't supposed to be attached to computational graph. When verifying valid_metrics, I don't see grad_fn. Guessing that's the issue, I tried to call fake.detach() when computing train_metrics. Now the training is stable, the GPU memory stops increasing non stop.