IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.
https://detrex.readthedocs.io/en/latest/
Apache License 2.0
1.97k stars 206 forks source link

Why can the evaluation be run, but the training is not? #185

Closed ma3252788 closed 1 year ago

ma3252788 commented 1 year ago

My GPU model is 2080Ti 10G

When I tested with coco dataset, it was fine to Evaluation.

image

But when I do the training, it keeps alerting that:

    return forward_call(*input, **kwargs)
  File "/home/ubuntu/16T/part4/detrex/detectron2/detectron2/modeling/backbone/swin.py", line 454, in forward
    x = blk(x, attn_mask)
  File "/home/ubuntu/anaconda3/envs/detrex/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/16T/part4/detrex/detectron2/detectron2/modeling/backbone/swin.py", line 280, in forward
    attn_windows = self.attn(x_windows, mask=attn_mask)  # nW*B, window_size*window_size, C
  File "/home/ubuntu/anaconda3/envs/detrex/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/16T/part4/detrex/detectron2/detectron2/modeling/backbone/swin.py", line 152, in forward
    attn = q @ k.transpose(-2, -1)
RuntimeError: CUDA out of memory. Tried to allocate 38.00 MiB (GPU 0; 10.76 GiB total capacity; 9.04 GiB already allocated; 19.94 MiB free; 9.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Even if I modify the config file:

# modify dataloader config
dataloader.train.num_workers = 1

# please notice that this is total batch size.
# surpose you're using 4 gpus for training and the batch size for
# each gpu is 16/4 = 4
dataloader.train.total_batch_size = 1

Is there any other place to reduce the memory occupation? Can't 2080Ti even run batch size=1?

ma3252788 commented 1 year ago

maybe https://github.com/IDEA-Research/detrex/issues/158#issuecomment-1337609119