When you run the train.py.How much memory have you cost?

S-aiueo32 / srntt-pytorch

A PyTorch implementation of SRNTT, which is a novel RefSR method.

Apache License 2.0

109 stars 24 forks source link

When you run the train.py.How much memory have you cost? #15

Open 6ABCD0 opened 3 years ago

6ABCD0 commented 3 years ago

I trained it on TITAN XP(12G), But it occured some mistakes about 'out of memory'. And I found its memory cost increased with the number of forwarding time increased(In other words, when I run "python3 train.py --use_weights --netG_pre ./pretrain_model/netG_100.pth --netD_pre ./pretrain_model/netD_100.pth", the memory cost is 5G at begin, but with the code running, the memory cost increased to 12G, and finally，it exceeded 12G ) @S-aiueo32 screenshot of mistake

S-aiueo32 commented 3 years ago

hmm... What version of your PyTorch? Depending on the version, some memory leaking may be led.

6ABCD0 commented 3 years ago

My torch version is pytorch1.7

S-aiueo32 commented 3 years ago

I used Pytorch 1.3 when I worked on this project. If you can, try to run it on pipenv, which will reproduce my environment. I have never faced the issue throughout over 100 epoch training.

6ABCD0 commented 3 years ago

thank you~I will try it again

tsogkas commented 3 years ago

@WangduoXie I had the same issue and this fixed it for me: https://discuss.pytorch.org/t/memory-leak-with-wgan-gp-loss/112117

6ABCD0 commented 3 years ago

Thanks for your notification~ Best! Wangduo

------------------ 原始邮件 ------------------ 发件人: "S-aiueo32/srntt-pytorch" @.>; 发送时间: 2021年5月15日(星期六) 凌晨1:12 @.>; @.**@.>; 主题: Re: [S-aiueo32/srntt-pytorch] When you run the train.py.How much memory have you cost? (#15)

@WangduoXie I had the same issue and this fixed it for me: https://discuss.pytorch.org/t/memory-leak-with-wgan-gp-loss/112117

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

HITRainer commented 2 years ago

Just delete "torch.autograd.set_detect_anomaly(True)" in train.py and then it works.