Open stephengreen opened 2 years ago
Update: The GPU RAM usage does not seem to increase progressively after loading from a checkpoint.
After running longer, the GPU RAM keeps growing, as does the CPU RAM on the main process. Are you finding the same @max-dax @jonaswildberger @kauii8school ?
I wonder if it could be this... https://github.com/pytorch/pytorch/issues/13246
yeah could be. I remember when I was looking for the memory leak in the very beginning I came across this (which is why I was pushing against dicts and lists in the dingo dataloader). There may still be an old e-mail thread about this.
Which lists/dicts could it be?
So based on my WandB runs, I did not encounter this problem. There, the GPU memory usage seems to remain constant
Hm, okay. I will check again next time I train.
I think this is still an issue, but with CPU RAM. Will look into it again.
Using
nvidia-smi -l 1
to monitor GPU during training, I notice that the GPU RAM usage slowly increases during each epoch (at least during the first training run). This adds up to ~ 100 MB / epoch.Also, when loading for the first time from a checkpoint, the GPU RAM usage is higher (by ~ 500 GB) than before.