VITA-Group / EnlightenGAN

[IEEE TIP] "EnlightenGAN: Deep Light Enhancement without Paired Supervision" by Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou, Zhangyang Wang
Other
890 stars 198 forks source link

High CPU RAM consumption for training and testing #61

Closed JustinLiu97 closed 3 years ago

JustinLiu97 commented 3 years ago

Hi TAMU-VITA:

Thank you so much for your impressive work! The results are just amazing.

When I was doing training and testing, I found out that even when testing a single image, the python process will consume a total of almost 300 GB CPU RAM. And also the RAM usage will increase with the increase of epochs during training. The RAM usage increase can be avoided by setting num_workers (n_Threads) to 0, but it will still take almost 300 GB RAM.

I am new to PyTorch, so is there any information on what caused this problem? The size of the model weights seems to be not that large.

Thanks!

yifanjiang19 commented 3 years ago

Hi Justin,

Thank you for interesting at our work. I've seen several issues about similar problem but I didn't meet it in my own server. Maybe it is due to the python version, pytorch version. Sorry I can not help with that.

JustinLiu97 commented 3 years ago

Hi Justin,

Thank you for interesting at our work. I've seen several issues about similar problem but I didn't meet it in my own server. Maybe it is due to the python version, pytorch version. Sorry I can not help with that.

Thanks for your reply! Btw is it possible for you to share your environment settings? I am using Python 3.5.2, and the same PyTorch version (0.3.1) as requirement.txt, but I am running with CUDA 10.0 and corresponding cudnn 7.6.5 (on RTX 2080 Max Q). Maybe I can try with your CUDA settings and see if the problem still appears? Thanks!

yifanjiang19 commented 3 years ago

I think this is not caused by CUDA version. Could youi please try to change --pool_size. This is mostly related to the RAM cost.

JustinLiu97 commented 3 years ago

I think this is not caused by CUDA version. Could youi please try to change --pool_size. This is mostly related to the RAM cost.

I changed the --pool_size for training and it still takes 267GB of RAM (although decreased for almost 30GB). I also noticed that the training and testing needs more than 10 minutes to start, and RAM usage during testing is like maintaining at a low level for some time, then increased to almost 150GB and again stay stable for some time, and then gradually increased to almost 290GB until the end of testing.

yifanjiang19 commented 3 years ago

@JustinLiu97 https://github.com/TAMU-VITA/EnlightenGAN/blob/982e7a9b62599084ab75fb0a5c1e291d04f88fc3/predict.py#L32 Could you please check this line? I think the webpage will save all images to the buffer until all iterations finish.

JustinLiu97 commented 3 years ago

@JustinLiu97

https://github.com/TAMU-VITA/EnlightenGAN/blob/982e7a9b62599084ab75fb0a5c1e291d04f88fc3/predict.py#L32

Could you please check this line? I think the webpage will save all images to the buffer until all iterations finish.

I commented this line and still get a 290GB+ RAM usage during testing

yifanjiang19 commented 3 years ago

Actually you need to delete all related lines about webpage.

JustinLiu97 commented 3 years ago

Actually you need to delete all related lines about webpage.

I have commented every line which has relation with visualizer but still get the same outcome, so maybe it is not caused by visualizer (I am testing on one single image)

yifanjiang19 commented 3 years ago

Thanks for your feedback. Let me know if you find any way to solve this problem.

JustinLiu97 commented 3 years ago

Thanks for your feedback. Let me know if you find any way to solve this problem.

Sure. Thank you so much for your advice. I will close this first and update here if I find any solution