Lightning-AI / torchmetrics

Torchmetrics - Machine learning metrics for distributed, scalable PyTorch applications.
https://lightning.ai/docs/torchmetrics/
Apache License 2.0
2.08k stars 398 forks source link

GPU ram memory increase until overflow when using PSNR and SSIM #2597

Open ouioui199 opened 3 months ago

ouioui199 commented 3 months ago

🐛 Bug

Hello all,

I'm implementing CycleGAN with Lightning. I use PSNR and SSIM from torchmetrics for evaluation. During training, I see that my GPU ram memory increases non stop until overflow and the whole training shuts down. This might similar to https://github.com/Lightning-AI/torchmetrics/issues/2481

To Reproduce

Add this to init method of model class:

self.train_metrics = MetricCollection({"PSNR": PeakSignalNoiseRatio(), "SSIM": StructuralSimilarityIndexMeasure()})
self.valid_metrics = self.train_metrics.clone(prefix='val_')

In training_step method: train_metrics = self.train_metrics(fake, real)

In validation_step method: valid_metrics = self.valid_metrics(fake, real)

Environment

Easy fix proposition

I try to debug the code. When verifying train_metrics, I get this:

"{'PSNR': tensor(10.5713, device='cuda:0', grad_fn=<SqueezeBackward0>), 'SSIM': tensor(0.0373, device='cuda:0', grad_fn=<SqueezeBackward0>)}"

which is weird because metrics aren't supposed to be attached to computational graph. When verifying valid_metrics, I don't see grad_fn. Guessing that's the issue, I tried to call fake.detach() when computing train_metrics. Now the training is stable, the GPU memory stops increasing non stop.

github-actions[bot] commented 3 months ago

Hi! thanks for your contribution!, great first issue!

Borda commented 3 weeks ago

@ouioui199 looking at your example (could you pls share the full sample code?) and wondering if you in the epoch end hook also call compute?