WongKinYiu / yolov7

Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
GNU General Public License v3.0
13.22k stars 4.18k forks source link

RuntimeError: CUDA out of memory #1124

Closed xddun closed 1 year ago

xddun commented 1 year ago

Thank you for your work, but I think there is still much room for improvement in the practicality of the project. There is no problem with COCO 2017 data, but this problem occurred when I tried to perform 2019 Objects365 data . Can train some epochs, but will have problems:


(py37c) xiedong@gpu20:/ssd/xiedong/workplace/yolov7$ python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train.py --workers 16 --device 0,1,2,3 -                       -sync-bn --batch-size 128 --data data/Objects365_2019.yaml --img 416 416 --cfg cfg/training/yolov7-tiny.yaml --weights weights/yolov7-tiny.pt --name yolov7tiny_obj365                       2019 --hyp data/hyp.scratch.tiny.yaml --resume
/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal                        performance in your application as needed.
*****************************************
Resuming training from ./runs/train/yolov7tiny_obj36520194/weights/last.pt
YOLOR ๐Ÿš€ 2022-11-16 torch 1.12.1+cu116 CUDA:0 (NVIDIA A100-PCIE-40GB, 40390.0625MB)
                                      CUDA:1 (NVIDIA A100-PCIE-40GB, 40390.0625MB)
                                      CUDA:2 (NVIDIA A100-PCIE-40GB, 40390.0625MB)
                                      CUDA:3 (NVIDIA A100-PCIE-40GB, 40390.0625MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Namespace(adam=False, artifact_alias='latest', batch_size=64, bbox_interval=-1, bucket='', cache_images=False, cfg='', data='data/Objects365_2019.yaml', device='0,1,2,3', entity=None, epochs=300, evolve=False, exist_ok=False, freeze=[0], global_rank=0, hyp='data/hyp.scratch.tiny.yaml', image_weights=False, img_size=[416, 416], label_smoothing=0.0, linear_lr=False, local_rank=0, multi_scale=False, name='yolov7tiny_obj3652019', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=True, save_dir='runs/train/yolov7tiny_obj36520194', save_period=-1, single_cls=False, sync_bn=True, total_batch_size=256, upload_dataset=False, v5_metric=False, weights='./runs/train/yolov7tiny_obj36520194/weights/last.pt', workers=16, world_size=4)
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.05, copy_paste=0.0, paste_in=0.05, loss_ota=1
wandb: Install Weights & Biases for YOLOR logging with 'pip install wandb' (recommended)

                 from  n    params  module                                  arguments
  0                -1  1       928  models.common.Conv                      [3, 32, 3, 2, None, 1, LeakyReLU(negative_slope=0.1)]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2, None, 1, LeakyReLU(negative_slope=0.1)]
  2                -1  1      2112  models.common.Conv                      [64, 32, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
  3                -2  1      2112  models.common.Conv                      [64, 32, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
  4                -1  1      9280  models.common.Conv                      [32, 32, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
  5                -1  1      9280  models.common.Conv                      [32, 32, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
  6  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]
  7                -1  1      8320  models.common.Conv                      [128, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
  8                -1  1         0  models.common.MP                        []
  9                -1  1      4224  models.common.Conv                      [64, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 10                -2  1      4224  models.common.Conv                      [64, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 11                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 12                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 13  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 15                -1  1         0  models.common.MP                        []
 16                -1  1     16640  models.common.Conv                      [128, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 17                -2  1     16640  models.common.Conv                      [128, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 19                -1  1    147712  models.common.Conv                      [128, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 20  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]
 21                -1  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 22                -1  1         0  models.common.MP                        []
 23                -1  1     66048  models.common.Conv                      [256, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 24                -2  1     66048  models.common.Conv                      [256, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 25                -1  1    590336  models.common.Conv                      [256, 256, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 26                -1  1    590336  models.common.Conv                      [256, 256, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 27  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]
 28                -1  1    525312  models.common.Conv                      [1024, 512, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 29                -1  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 30                -2  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 31                -1  1         0  models.common.SP                        [5]
 32                -2  1         0  models.common.SP                        [9]
 33                -3  1         0  models.common.SP                        [13]
 34  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]
 35                -1  1    262656  models.common.Conv                      [1024, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 36          [-1, -7]  1         0  models.common.Concat                    [1]
 37                -1  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 38                -1  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 39                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 40                21  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 41          [-1, -2]  1         0  models.common.Concat                    [1]
 42                -1  1     16512  models.common.Conv                      [256, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 43                -2  1     16512  models.common.Conv                      [256, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 44                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 45                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 46  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]
 47                -1  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 48                -1  1      8320  models.common.Conv                      [128, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 49                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 50                14  1      8320  models.common.Conv                      [128, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 51          [-1, -2]  1         0  models.common.Concat                    [1]
 52                -1  1      4160  models.common.Conv                      [128, 32, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 53                -2  1      4160  models.common.Conv                      [128, 32, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 54                -1  1      9280  models.common.Conv                      [32, 32, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 55                -1  1      9280  models.common.Conv                      [32, 32, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 56  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]
 57                -1  1      8320  models.common.Conv                      [128, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 58                -1  1     73984  models.common.Conv                      [64, 128, 3, 2, None, 1, LeakyReLU(negative_slope=0.1)]
 59          [-1, 47]  1         0  models.common.Concat                    [1]
 60                -1  1     16512  models.common.Conv                      [256, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 61                -2  1     16512  models.common.Conv                      [256, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 62                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 63                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 64  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]
 65                -1  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 66                -1  1    295424  models.common.Conv                      [128, 256, 3, 2, None, 1, LeakyReLU(negative_slope=0.1)]
 67          [-1, 37]  1         0  models.common.Concat                    [1]
 68                -1  1     65792  models.common.Conv                      [512, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 69                -2  1     65792  models.common.Conv                      [512, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 70                -1  1    147712  models.common.Conv                      [128, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 71                -1  1    147712  models.common.Conv                      [128, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 72  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]
 73                -1  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 74                57  1     73984  models.common.Conv                      [64, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 75                65  1    295424  models.common.Conv                      [128, 256, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 76                73  1   1180672  models.common.Conv                      [256, 512, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 77      [74, 75, 76]  1   1002116  models.yolo.IDetect                     [365, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Model Summary: 263 layers, 6999972 parameters, 6999972 gradients, 16.3 GFLOPS

/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Transferred 344/344 items from ./runs/train/yolov7tiny_obj36520194/weights/last.pt
Scaled weight_decay = 0.002
Optimizer groups: 58 .bias, 58 conv.weight, 61 other
Using SyncBatchNorm()
train: Scanning 'objects365_2019/train.cache' images and labels... 608576 found, 30 missing, 43 empty, 0 corrupted: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 608606/608606 [00:00<?, ?it/s]
val: Scanning 'objects365_2019/val.cache' images and labels... 30000 found, 0 missing, 1 empty, 0 corrupted: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 30000/30000 [00:00<?, ?it/s]
train: Scanning 'objects365_2019/train.cache' images and labels... 608576 found, 30 missing, 43 empty, 0 corrupted: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 608606/608606 [00:00<?, ?it/s]
train: Scanning 'objects365_2019/train.cache' images and labels... 608576 found, 30 missing, 43 empty, 0 corrupted: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 608606/608606 [00:00<?, ?it/s]
train: Scanning 'objects365_2019/train.cache' images and labels... 608576 found, 30 missing, 43 empty, 0 corrupted: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 608606/608606 [00:00<?, ?it/s]
Image sizes 416 train, 416 test
Using 14 dataloader workers
Logging results to runs/train/yolov7tiny_obj36520194
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
     4/299     8.25G   0.04207   0.06424   0.03152    0.1378      1746       416:   0%|                                                                   | 1/2378 [00:05<3:48:15,  5.76s/it]Reducer buckets have been rebuilt in this iteration.
     4/299     3.14G   0.04147   0.06523   0.03385    0.1405       642       416: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2378/2378 [24:15<00:00,  1.63it/s]
               Class      Images      Labels           P           R      mAP@.5  mAP@.5:.95: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 235/235 [03:01<00:00,  1.29it/s]
                 all       30000      455258       0.494       0.146       0.124      0.0687

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
     5/299     31.5G   0.04139   0.06513   0.03333    0.1399       761       416: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2378/2378 [23:50<00:00,  1.66it/s]
               Class      Images      Labels           P           R      mAP@.5  mAP@.5:.95: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 235/235 [02:52<00:00,  1.36it/s]
                 all       30000      455258       0.501       0.147       0.129      0.0714

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
     6/299     31.5G   0.04141   0.06474   0.03292    0.1391       537       416: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2378/2378 [23:56<00:00,  1.66it/s]
               Class      Images      Labels           P           R      mAP@.5  mAP@.5:.95: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 235/235 [02:56<00:00,  1.34it/s]
                 all       30000      455258       0.513       0.148        0.13      0.0726

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
     7/299     31.5G   0.04139   0.06443    0.0327    0.1385      1733       416:  65%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ                       | 1534/2378 [15:26<08:18,  1.69it/s]Traceback (most recent call last):
  File "train.py", line 619, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 365, in train
    loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs)  # loss scaled by batch_size
  File "/ssd/xiedong/workplace/yolov7/utils/loss.py", line 585, in __call__
    bs, as_, gjs, gis, targets, anchors = self.build_targets(p, targets, imgs)
  File "/ssd/xiedong/workplace/yolov7/utils/loss.py", line 733, in build_targets
    torch.log(y/(1-y)) , gt_cls_per_image, reduction="none"
  File "/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/site-packages/torch/nn/functional.py", line 3150, in binary_cross_entropy_with_logits
    return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
RuntimeError: CUDA out of memory. Tried to allocate 3.23 GiB (GPU 1; 39.44 GiB total capacity; 33.79 GiB already allocated; 1.01 GiB free; 36.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 784390 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 784392 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 784393 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 784391) of binary: /ssd/xiedong/miniconda3/envs/py37c/bin/python
Traceback (most recent call last):
  File "/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/ssd/xiedong/miniconda3/envs/py37c/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-11-17_11:43:00
  host      : gpu20
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 784391)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

This GPU memory occupation is not normal, and will rise and change:

image

xddun commented 1 year ago

I look forward to you can pay attention to this issue. I see that other people have raised similar questions, but no reply.

nitin-dominic commented 1 year ago

Although I have been using a single GPU to train images, I do know that this error pops up when your GPU has less memory size that largely depends on the batch size you chose. So, in your command line (the python train.py...), if you could lower your batch size (I see you set that to 64), maybe then you may not get this error. Let me know if that worked!

xddun commented 1 year ago

Thank you for your reply very much ! I hope my genuine reply can help you improve the repository!

let me give an overview of the current situation, and then talk about some details.

overview:

(1) this is a mistake: i have 4 GPUs๏ผŒ but their GPU memory usage is too too different in training; (2) batch-size=1 ,this is a bad suggestion when we in yolov7-tiny with imgsize416*416 and we have GPU with 40G gpu memory. and even so, it still doesn't work. and , batch-size=1, means it takes a long long time to complete the training. (3) I will try the following: a. yolov5+my objects 365 datasets (in 2019 year) . this will verify if there is a problem with my dataset. b. yolov7-tiny+objects 365 datasets (in 2020 year) .

(4)I can almost confirm that the problem is in the data loading. Because sometimes I can train for full one epoch, sometimes I can't. It seems that we have to modify some code to make repository code more robust. (5)it work well : yolov7-tiny+coco 2017+ batch-size=any suitable value.

I will describe this problem more carefully:

then I train coco 2017 datasets like this:

python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train.py --workers 16 --device 0,1,2,3 --sync-bn --batch-size 1000 --data data/coco.yaml --img 416 416 --cfg cfg/training/yolov7-tiny.yaml --weights weights/yolov7-tiny.pt --name yolov7tiny_coco --hyp data/hyp.scratch.tiny.yaml

I have to admit, this is work well! Even if I train for more than 10 epochs, it will not reappear "[CUDA out of memory]", The memory usage of the 4 GPUs will remain the same occupy:

image

After that, I downloaded the objects365 dataset ๏ผˆ2019 https://github.com/lidc1004/Object-detection-converts ๏ผ‰, which is json data. I started training after converting it to the yolo format. this is convert code:

# -*- coding: UTF-8 -*-
import json
import os

jsonfile1 = "/ssd/xiedong/datasets/objects365/Annotations/val/val.json"
jsonfile2 = "/ssd/xiedong/datasets/objects365/Annotations/train/train.json"
for jsonfile in [jsonfile1, jsonfile2]:
    saveDstPath = os.path.dirname(jsonfile)
    with open(jsonfile, 'r', encoding="utf-8") as f:
        datas = json.load(f)

    id_names = {imt["id"]: imt["name"] for imt in datas["categories"]}
    imageid_hw_dict = {}
    for d in datas["images"]:
        imageid_hw_dict[d["id"]] = [d["width"], d["height"], d["file_name"]]
    annotations_imageid_idbox = {}
    for d in datas["annotations"]:
        if d["image_id"] not in annotations_imageid_idbox:
            annotations_imageid_idbox[d["image_id"]] = []
        annotations_imageid_idbox[d["image_id"]].append([d["bbox"], d["category_id"]])
    # ่ฝฌๆˆyolo
    for imageid in annotations_imageid_idbox:
        hw = imageid_hw_dict[imageid]
        w = hw[0]
        h = hw[1]
        filename = hw[2]
        with open(os.path.join(saveDstPath, filename.replace(".jpg", ".txt")), "w") as f:
            res_str = []
            for box1 in annotations_imageid_idbox[imageid]:
                box = box1[0]
                x_yolo = min((box[0] + box[2] / 2) / w, 1.0)
                y_yolo = min((box[1] + box[3] / 2) / h, 1.0)
                w_yolo = min(box[2] / w, 1.0)
                h_yolo = min(box[3] / h, 1.0)
                res_str.append(
                    "{} {} {} {} {}".format(box1[1] - 1, round(x_yolo, 6), round(y_yolo, 6), round(w_yolo, 6),
                                            round(h_yolo, 6)))  # ๅ‡ๅŽป1ๆ˜ฏๆŠŠ1ๅˆฐ365ๅ˜ๆˆ0ๅˆฐ364
            f.write("\n".join(res_str))

At the beginning of the training, I couldn't even train one epoch completely ! So I changed the yolo tag to remove small targets and targets that are too close to the edge. this is code:

import os
from tqdm import tqdm

def listPathAllfiles(dirname):
    result = []
    for maindir, subdir, file_name_list in os.walk(dirname):
        for filename in file_name_list:
            apath = os.path.join(maindir, filename)
            result.append(apath)
    return result

path = r"/ssd/xiedong/datasets/objects365/labels"
files = listPathAllfiles(path)

for file in tqdm(files):
    with open(file, "r") as f:
        lines = f.read().splitlines()
        lines_new = []
        for line in lines:
            if len(line) < 2:
                continue
            cid, x0, y0, w, h = list(map(float, line.split(" ")))
            # ไฟฎๆญฃๅๆ ‡
            if x0 >= 0.99 or y0 >= 0.99:  # ้™คๅŽปๅคช่พน่ง’็š„ๆ•ฐๆฎ
                continue
            if x0 < 0.01 and y0 < 0.01:  # ้™คๅŽปๅคช่พน่ง’็š„ๆ•ฐๆฎ
                continue
            if w < 0.01 or h < 0.01:  # ้™คๅŽปboxๅคชๅฐ็š„ๆ•ฐๆฎ
                continue
            if (x0 + w / 2 > 0.99):
                w = (0.99 - x0) * 2
            if (y0 + h / 2 > 0.99):
                h = (0.99 - y0) * 2
            if (x0 - w / 2 < 0.01):
                w = (x0 - 0.01) * 2
            if (y0 - h / 2 < 0.01):
                h = (y0 - 0.01) * 2

            if (x0 + w / 2 > 0.99):
                w = (0.99 - x0) * 2
            if (y0 + h / 2 > 0.99):
                h = (0.99 - y0) * 2
            if (x0 - w / 2 < 0.01):
                w = (x0 - 0.01) * 2
            if (y0 - h / 2 < 0.01):
                h = (y0 - 0.01) * 2
            str1 = str(int(cid)) + " " + str(round(x0, 6)) + " " + str(round(y0, 6)) + " " + str(
                round(w, 6)) + " " + str(round(h, 6))
            lines_new.append(str1)
    with open(file, "w") as f:
        f.write("\n".join(lines_new))

At this time, I train the model again, I can train several epochs!

But there are two problems:

(1) As I described at the beginning in this issues, you can see that the 4 GPUs memory occupied is not same .

(2) "RuntimeError: CUDA out of memory" will appear after several rounds of training .

I suspect that this is due to the numerical problem of the labeled data, But I don't know how to solve it.

at last ๏ผŒset "--batch-size 1 " is useless operation:

I have 4 GPUs, so i have to set "--batch-size 4 ".

when i run:

/ssd/xiedong/miniconda3/envs/py37c/bin/python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train.py --workers 16 --device 0,1,2,3 --sync-bn --batch-size 4 --data data/Objects365_2019.yaml --img 416 416 --cfg cfg/training/yolov7-tiny.yaml --weights weights/yolov7-tiny.pt --name yolov7tiny_obj3652019 --hyp data/hyp.scratch.tiny.yaml --resume

It can be seen from the gpu memory occupation that there is a problem (The following figure shows the datasets progress bar to 1%):

image

when the datasets progress bar is:

 0/299     34.9G    0.0631   0.05328    0.0924    0.2088        19       416:   9%|โ–ˆโ–ˆโ–ˆโ–ˆโ–‹                                               | 13764/152152 [28:36<4:50:19,  7.94it/s]

can get:

image

It can be said responsibly that what happens next : [RuntimeError: CUDA out of memory].

Actually, I have searched yolov7 issues before i propose my this issues.

xddun commented 1 year ago

The data processing method is very important. The instability of the numerical value causes this problem.

i use this way: https://github.com/ultralytics/yolov5/blob/master/data/Objects365.yaml ,

it work to me !

It's a mysterious experience.

HaoLiuHust commented 11 months ago

I have also encounter this problem, what batch size do you use finally?