Memory leaks during training

dingo-gw / dingo

Dingo: Deep inference for gravitational-wave observations

MIT License

60 stars 20 forks source link

Memory leaks during training #60

Open stephengreen opened 2 years ago

stephengreen commented 2 years ago

Using nvidia-smi -l 1 to monitor GPU during training, I notice that the GPU RAM usage slowly increases during each epoch (at least during the first training run). This adds up to ~ 100 MB / epoch.

Also, when loading for the first time from a checkpoint, the GPU RAM usage is higher (by ~ 500 GB) than before.

stephengreen commented 2 years ago

Update: The GPU RAM usage does not seem to increase progressively after loading from a checkpoint.

stephengreen commented 2 years ago

After running longer, the GPU RAM keeps growing, as does the CPU RAM on the main process. Are you finding the same @max-dax @jonaswildberger @kauii8school ?

stephengreen commented 2 years ago

I wonder if it could be this... https://github.com/pytorch/pytorch/issues/13246

max-dax commented 2 years ago

yeah could be. I remember when I was looking for the memory leak in the very beginning I came across this (which is why I was pushing against dicts and lists in the dingo dataloader). There may still be an old e-mail thread about this.

Which lists/dicts could it be?

jonaswildberger commented 2 years ago

So based on my WandB runs, I did not encounter this problem. There, the GPU memory usage seems to remain constant

stephengreen commented 2 years ago

Hm, okay. I will check again next time I train.

stephengreen commented 4 months ago

I think this is still an issue, but with CPU RAM. Will look into it again.