ashkamath / mdetr

Apache License 2.0
969 stars 125 forks source link

With the increase of MDETR training time, GPU memory occupation keeps increasing. After several epoch training, memory explodes,i.e. OOM, out of memory. #93

Closed linhuixiao closed 1 year ago

linhuixiao commented 1 year ago

With the increase of MDETR training time, GPU memory occupation keeps increasing. After several epoch training, memory explodes,i.e. OOM, out of memory. The issue list has many similar bugs that are the same as this.

training set: dataset: refcoco, batch_size = 4, 6 GPU RTX3090 24GB

start: (epoch 0) +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 1 N/A N/A 3740727 C ...envs/mdetr_env/bin/python 13280MiB | | 2 N/A N/A 3740727 C ...envs/mdetr_env/bin/python 12380MiB | | 3 N/A N/A 3740727 C ...envs/mdetr_env/bin/python 13680MiB | | 4 N/A N/A 3740728 C ...envs/mdetr_env/bin/python 17070MiB | | 5 N/A N/A 3740729 C ...envs/mdetr_env/bin/python 11802MiB | | 6 N/A N/A 3740730 C ...envs/mdetr_env/bin/python 13762MiB | +-----------------------------------------------------------------------------+

several times later: (epoch 0)

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 1 N/A N/A 3740727 C ...envs/mdetr_env/bin/python 14580MiB | | 2 N/A N/A 3740727 C ...envs/mdetr_env/bin/python 129180MiB | | 3 N/A N/A 3740727 C ...envs/mdetr_env/bin/python 15780MiB | | 4 N/A N/A 3740728 C ...envs/mdetr_env/bin/python 17070MiB | | 5 N/A N/A 3740729 C ...envs/mdetr_env/bin/python 16402MiB | | 6 N/A N/A 3740730 C ...envs/mdetr_env/bin/python 14062MiB | +-----------------------------------------------------------------------------+

epoch 1: (1 hour later)

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 1 N/A N/A 3740727 C ...envs/mdetr_env/bin/python 16980MiB | | 2 N/A N/A 3740727 C ...envs/mdetr_env/bin/python 194180MiB | | 3 N/A N/A 3740727 C ...envs/mdetr_env/bin/python 17580MiB | | 4 N/A N/A 3740728 C ...envs/mdetr_env/bin/python 17270MiB | | 5 N/A N/A 3740729 C ...envs/mdetr_env/bin/python 16602MiB | | 6 N/A N/A 3740730 C ...envs/mdetr_env/bin/python 14062MiB | +-----------------------------------------------------------------------------+

epoch 2: failed, out of memory.

There are a lot of bugs in the mdetr code, I would like to ask the author how to train these models? Leaving a bunch of holes for subsequent researchers. Oh my God. @alcinos @nguyeho7 @ashkamath

alcinos commented 1 year ago

Please be courteous in the issues. We released the software under the license Apache, which specifically implies we are not assuming liability nor providing any warranty. Phrases like "There are a lot of bugs in the mdetr code" are a pretty direct, uncivil attacks, even more so when it's not backed by facts.

Now about your "issue": the reason why the memory increases over training is because of padding. If you are unlucky, the may have a 800x1333 image and a 1333x800 in the same batch, which needs to be padded to 1333x1333 and that can sometimes push it over the edge if you have limited memory. If you had read our instructions carefully, you'd have noticed that we recommend fine-tuning on refcoco with two gpus with bs=4/gpu, hence a global bs of 8, while you are trying to do 6*4 = 24, which is likely to give incomparable results. You should use bs=2 on 4 gpus, which will be the same total batch but use less memory, thus solving your initial issue.