OOM errors when training base model

markweberdev commented 2 years ago

Hi @imisra and @xingyizhou,

Thanks so much for this great repository. I run into an issue of out-of-memory for the GPU and am hoping you can help me figuring it out.

I followed the instructions to set up the repository. I would like to train Base-C2_L_R5021k_640b64_4x.yaml. My training setup are 4 GPU with 48GB VRAM each. The paper mentions that 8 V100 GPUs (which I believe have less VRAM?) were used for training with a batchsize of 64. Thus, I reduced the batchsize to 32 initially and then later tried batch size 16 as well, which all end up giving me CUDA OOM errors. At iteration 20, for a batch size of 32, the model consumes roughly 5GB VRAM.Then, the memory usage increases linearly over time until at around iteration 12000, the necessary memory is more than 48GB VRAM, hence giving me OOM. I tried the same with batch size of 16, but that just postpones the error. Is such increase of VRAM expected? To me this looks like a memory leak somewhere, but since you trained it successfully, I was wondering what do you think goes wrong?

My settings are as follows:

Python: 3.7.11
CUDA 11.3.1
Pytorch 1.10.2
Detectron2: https://github.com/facebookresearch/detectron2/tree/9176db667837c57e7118d92a90f887565bd6d4c1
I use: python train_net.py --num-gpus=4 --config-file="configs/Base-C2_L_R5021k_640b32_1x_lvis.yaml"
I could reproduce the issues on GPUs with either arch 7.5 or 8.6 (Compute Capability).

Command Line Args: Namespace(config_file='configs/Base-C2_L_R5021k_640b32_1x_lvis.yaml', dist_url='tcp://127.0.0.1:53865', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=[], resume=False)

Thanks, Mark

xingyizhou commented 2 years ago

Hi Mark,

Thank you for running our code. This is strange. I never experienced this. I note your cuda/ pytorch version is different from ours (CUDA 11.1/ pytorch 1.8-lst). Would you mind trying in our setting and see if it is still the case?

Best, Xingyi

markweberdev commented 2 years ago

Hi Xingyi,

TL;DR: Thanks, it works!

I very much appreciate your quick response! Thanks.

I followed your advice and tried CUDA 11.1 / pytorch 1.8-lst and I no longer have issues with memory leaks (it's still training, but at this point I assume it's safe to say it works). So whatever has been introduced in between pytorch 1.8 and pytorch 1.10 or between CUDA 11.1 and CUDA 11.3.1 is causing a huge memory leak.

For reference: The model consumes around 9GB VRAM with CUDA 11.1 / pytorch 1.8-lst, while it quickly needed more than 48GB VRAM per GPU for CUDA 11.3.1/ pytorch 1.10.

Perhaps the pytorch team might be interested in this issue.

Thanks, Mark

facebookresearch / Detic

OOM errors when training base model #23