Question about out-of-memory error during training.

HenryHZY commented 2 months ago

Hi @HengLan @GX77 I have reproduced the training of CGSTVG following the default VidSTG config, with 8 A100 (80GB).

I only conduct necessary modifications:

1. model weight path on yaml
2. model weight path on models/vidswin/video_swin_transformer.py, models/language_model/bert.py (to avoid errors)
3. updated transformers version (to avoid errors)
4. copy 'train_net.py' to /path/to/CGSTVG (to avoid errors whern running python scripts/train_net.py)

It seems that there exists a memory leak during the training. Here is my selected log (error occurs in ~epoch7):

Video Grounding INFO: eta: 3 days, 11:45:45  iter: 50 / 100850  loss: 8.2850 (8.9416)  loss_actioness: 0.7642 (0.7673)  loss_bbox: 3.6686 (3.9498)  loss_conf: 0.9072 (1.0352)  loss_giou: 2.7466 (2.9188)  loss_sted: 0.2700 (0.2705)  time: 2.2971 (2.9915)  data: 0.0714 (0.3275)  lr: 0.000015  lr_vis_encoder: 0.000000  lr_text_encoder: 0.000002  lr_temp_decoder: 0.000005  max mem: 38893
...
Video Grounding INFO: eta: 2 days, 22:06:57  iter: 1000 / 100850  loss: 5.5472 (7.0571)  loss_actioness: 0.7282 (0.7533)  loss_bbox: 1.7671 (2.6599)  loss_conf: 0.7619 (0.7985)  loss_giou: 1.9483 (2.5755)  loss_sted: 0.2689 (0.2700)  time: 2.1783 (2.5280)  data: 0.0682 (0.1334)  lr: 0.000298  lr_vis_encoder: 0.000010  lr_text_encoder: 0.000050  lr_temp_decoder: 0.000099  max mem: 50155
...
Video Grounding INFO: eta: 2 days, 19:02:44  iter: 5650 / 100850  loss: 4.1014 (5.3206)  loss_actioness: 0.6372 (0.6936)  loss_bbox: 1.1094 (1.7375)  loss_conf: 0.6772 (0.7204)  loss_giou: 1.2793 (1.9460)  loss_sted: 0.2009 (0.2232)  time: 2.3899 (2.5353)  data: 0.1497 (0.1284)  lr: 0.000300  lr_vis_encoder: 0.000010  lr_text_encoder: 0.000050  lr_temp_decoder: 0.000100  max mem: 60105
...
Video Grounding INFO: eta: 19:19:30  iter: 73500 / 100850  loss: 3.1801 (4.1192)  loss_actioness: 0.6139 (0.6381)  loss_bbox: 0.7912 (1.2057)  loss_conf: 0.6276 (0.6508)  loss_giou: 1.0317 (1.4404)  loss_sted: 0.1679 (0.1843)  time: 2.2512 (2.5437)  data: 0.2396 (0.1695)  lr: 0.000300  lr_vis_encoder: 0.000010  lr_text_encoder: 0.000050  lr_temp_decoder: 0.000100  max mem: 66767

Traceback (most recent call last):
  File "CGSTVG-f062d84/train_net.py", line 354, in <module>
    main()
  File "CGSTVG-f062d84/train_net.py", line 347, in main
    model, model_ema = train(cfg, args.local_rank, args.distributed, logger)
  File "CGSTVG-f062d84/train_net.py", line 135, in train
    outputs = model(videos, texts, targets, iteration/max_iter)
  File "miniconda3/envs/stg/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "miniconda3/envs/stg/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "miniconda3/envs/stg/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "miniconda3/envs/stg/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "CGSTVG-f062d84/models/pipeline.py", line 54, in forward
    vid_features = self.vid(videos.tensors, len(videos.tensors))
  File "miniconda3/envs/stg/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "CGSTVG-f062d84/models/vidswin/video_swin_transformer.py", line 700, in forward
    vid_embeds = layer(vid_embeds.contiguous())
  File "miniconda3/envs/stg/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "CGSTVG-f062d84/models/vidswin/video_swin_transformer.py", line 414, in forward
    x = blk(x, attn_mask)
  File "miniconda3/envs/stg/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "CGSTVG-f062d84/models/vidswin/video_swin_transformer.py", line 272, in forward
    x = self.forward_part1(x, mask_matrix)
  File "CGSTVG-f062d84/models/vidswin/video_swin_transformer.py", line 243, in forward_part1
    attn_windows = self.attn(x_windows, mask=attn_mask)  # B*nW, Wd*Wh*Ww, C
  File "miniconda3/envs/stg/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "CGSTVG-f062d84/models/vidswin/video_swin_transformer.py", line 160, in forward
    attn = attn + relative_position_bias.unsqueeze(0)  # B_, nH, N, N
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.12 GiB (GPU 6; 79.32 GiB total capacity; 60.18 GiB already allocated; 3.08 GiB free; 74.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Have you encountered this issue? Could you please share your training log? Thank you! :)

HenryHZY commented 2 months ago

By the way, I am a little confused about the file structure of hc-stvg2. For example:

1. Official: "train_v2.json, val_v2.json, query_v2.json"
2. CGSTVG: "train.json, test.json"

Should I just rename these files?

HenryHZY commented 2 months ago

I think I can reproduce STCAT to find that if this memory increasing issue is inherited from STCAT (https://github.com/jy0205/STCAT).

Update: After reproducing STCAT, this memory increasing issue is inherited from STCAT.

HengLan / CGSTVG

Question about out-of-memory error during training. #8