dorarad / gansformer

Generative Adversarial Transformers
MIT License
1.32k stars 149 forks source link

Memory issue when training 1024 resolution #33

Closed BlueberryGin closed 2 years ago

BlueberryGin commented 2 years ago

I'm trying to train a 1024x1024 database on a V100 GPU. I tried both the tensorflow version and the pytorch version. Despite setting batch-gpu to 1, the tensorflow version always run out of system RAM(after the first tick, system ram total 51gb), and the pytorch version alway run out of cuda memory(before the first tick).

Here are my training settings:

python run_network.py --train --metrics 'none' --gpus 0 --batch-gpu 1 --resolution 1024 \
 --ganformer-default --expname art1 --dataset 1024art

Also, I always encounter the warning: tcmalloc: large alloc

BlueberryGin commented 2 years ago

Did dome investigation on the tensorflow model, turns out the problem occurs when saving snapshot images. There is probably some kind of memory leak when saving large images in visualize. It is fine when not saving images.

Will investigate further later

dorarad commented 2 years ago

Hi, thanks for reaching out! I noticed indeed that the visualization takes a lot of RAM but haven't yet tracked the issue since it's not a stateful module and so I'm not sure where specifically it could lead to a memory leak. However, I think the issue is that when it tries to make a visualization it holds in a memory 28 model outputs at the same time (including stacks of attention maps), so reducing the grid size of saves images here: https://github.com/dorarad/gansformer/blob/main/training/misc.py#L306 could mitigate the issue.

I'll be making couple changes so that memory consumption is reduced by default, and looking forward to hearing if you find by any chance anything further!

BlueberryGin commented 2 years ago

Cool, I will try that tomorrow and keep investigating:) Thanks!

BlueberryGin commented 2 years ago

Btw, I'm currently running fine with only saving output images and not saving attention maps

dorarad commented 2 years ago

I'll update the default options in accordance with that so the people won't get memory issues. Thank you for the openning this issue!