Crashing during first epoch - No crash message

K4HVH commented 7 months ago

Issue:

When starting training, YoloV9 will crash during the first epoch. There is no error message, it will simply quit out.

PS D:\Coding\yolov9> python train_dual.py  --batch 16  --img 640 --epochs 10 --min-items 0 --close-mosaic 15 --data data.yaml --weights yolov9-c-converted.pt --device 0 --cfg yolov9-c.yaml --hyp data/hyps/hyp.scratch-high.yaml
train_dual: weights=yolov9-c-converted.pt, cfg=yolov9-c.yaml, data=data.yaml, hyp=data/hyps/hyp.scratch-high.yaml, epochs=10, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs\train, name=exp, exist_ok=False, quad=False, cos_lr=False, flat_cos_lr=False, fixed_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, min_items=0, close_mosaic=15, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
YOLO  v0.1-85-gbad4f4b Python-3.12.2 torch-2.2.2+cu121 CUDA:0 (NVIDIA GeForce RTX 3080, 10239MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, cls_pw=1.0, obj=0.7, obj_pw=1.0, dfl=1.5, iou_t=0.2, anchor_t=5.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.9, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.15, copy_paste=0.3
Comet: run 'pip install comet_ml' to automatically track and visualize YOLO  runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs\train', view at http://localhost:6006/
ClearML Task: overwriting (reusing) task id=~snip~
ClearML results page: ~snip~

                 from  n    params  module                                  arguments
  0                -1  1         0  models.common.Silence                   []
  1                -1  1      1856  models.common.Conv                      [3, 64, 3, 2]
  2                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  3                -1  1    212864  models.common.RepNCSPELAN4              [128, 256, 128, 64, 1]
  4                -1  1    164352  models.common.ADown                     [256, 256]
  5                -1  1    847616  models.common.RepNCSPELAN4              [256, 512, 256, 128, 1]
  6                -1  1    656384  models.common.ADown                     [512, 512]
  7                -1  1   2857472  models.common.RepNCSPELAN4              [512, 512, 512, 256, 1]
  8                -1  1    656384  models.common.ADown                     [512, 512]
  9                -1  1   2857472  models.common.RepNCSPELAN4              [512, 512, 512, 256, 1]
 10                -1  1    656896  models.common.SPPELAN                   [512, 512, 256]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 7]  1         0  models.common.Concat                    [1]
 13                -1  1   3119616  models.common.RepNCSPELAN4              [1024, 512, 512, 256, 1]
 14                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 15           [-1, 5]  1         0  models.common.Concat                    [1]
 16                -1  1    912640  models.common.RepNCSPELAN4              [1024, 256, 256, 128, 1]
 17                -1  1    164352  models.common.ADown                     [256, 256]
 18          [-1, 13]  1         0  models.common.Concat                    [1]
 19                -1  1   2988544  models.common.RepNCSPELAN4              [768, 512, 512, 256, 1]
 20                -1  1    656384  models.common.ADown                     [512, 512]
 21          [-1, 10]  1         0  models.common.Concat                    [1]
 22                -1  1   3119616  models.common.RepNCSPELAN4              [1024, 512, 512, 256, 1]
 23                 5  1    131328  models.common.CBLinear                  [512, [256]]
 24                 7  1    393984  models.common.CBLinear                  [512, [256, 512]]
 25                 9  1    656640  models.common.CBLinear                  [512, [256, 512, 512]]
 26                 0  1      1856  models.common.Conv                      [3, 64, 3, 2]
 27                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
 28                -1  1    212864  models.common.RepNCSPELAN4              [128, 256, 128, 64, 1]
 29                -1  1    164352  models.common.ADown                     [256, 256]
 30  [23, 24, 25, -1]  1         0  models.common.CBFuse                    [[0, 0, 0]]
 31                -1  1    847616  models.common.RepNCSPELAN4              [256, 512, 256, 128, 1]
 32                -1  1    656384  models.common.ADown                     [512, 512]
 33      [24, 25, -1]  1         0  models.common.CBFuse                    [[1, 1]]
 34                -1  1   2857472  models.common.RepNCSPELAN4              [512, 512, 512, 256, 1]
 35                -1  1    656384  models.common.ADown                     [512, 512]
 36          [25, -1]  1         0  models.common.CBFuse                    [[2]]
 37                -1  1   2857472  models.common.RepNCSPELAN4              [512, 512, 512, 256, 1]
 38[31, 34, 37, 16, 19, 22]  1  21545132  models.yolo.DualDDetect                 [2, [512, 512, 512, 256, 512, 512]]
yolov9-c summary: 962 layers, 51001900 parameters, 51001868 gradients, 238.9 GFLOPs

Transferred 18/1460 items from yolov9-c-converted.pt
AMP: checks passed
optimizer: SGD(lr=0.01) with parameter groups 238 weight(decay=0.0), 255 weight(decay=0.0005), 253 bias
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning D:\Coding\yolov9\datasets\train\labels.cache... 6430 images, 5088 backgrounds, 0 corrupt: 100%|████████
val: Scanning D:\Coding\yolov9\datasets\valid\labels.cache... 424 images, 341 backgrounds, 0 corrupt: 100%|██████████|
Plotting labels to runs\train\exp2\labels.jpg...
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs\train\exp2
Starting training for 10 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
        0/9      19.9G  2.722e-11      209.2  1.517e-11          5        640:   0%|          | 0/402 00:20
PS D:\Coding\yolov9>

Setup/Environment

Windows 11 23H2
Pip Requirements Installed
Cuda 12.1 Torch (+vision/audio too)
Specs: RTX 3080, R9 3950X, 32GB DDR4

Other Details

All datasets I have tried to use were exported from Roboflow. I have tried 3 different sets with the same result each time.
Changing to the Gelan Models doesn't affect the result (It still crashes)
I reverted to Ultralytics YoloV8 and still had the same issue

Configs:

Command Used: python train_dual.py --batch 16 --img 640 --epochs 50 --min-items 0 --close-mosaic 15 --data data.yaml --weights yolov9-c-converted.pt --device 0 --cfg yolov9-c.yaml --hyp data/hyps/hyp.scratch-high.yaml

Data.yaml:

train: datasets/train/images
val: datasets/valid/images
test: datasets/test/images

nc: 2
names: ['Test1', 'Test2']

roboflow:
  workspace: kw4l0r-pbsii
  project: test-arefo
  version: 5
  license: CC BY 4.0
  url: https://universe.roboflow.com/kw4l0r-pbsii/test-arefo/dataset/5

Yolov9-c.yaml:

# YOLOv9

# parameters
nc: 2  # number of classes
depth_multiple: 1.0  # model depth multiple
width_multiple: 1.0  # layer channel multiple
#activation: nn.LeakyReLU(0.1)
#activation: nn.ReLU()

# anchors
anchors: 3

# YOLOv9 backbone
backbone:
  [
   [-1, 1, Silence, []],  

   # conv down
   [-1, 1, Conv, [64, 3, 2]],  # 1-P1/2

   # conv down
   [-1, 1, Conv, [128, 3, 2]],  # 2-P2/4

   # elan-1 block
   [-1, 1, RepNCSPELAN4, [256, 128, 64, 1]],  # 3

   # avg-conv down
   [-1, 1, ADown, [256]],  # 4-P3/8

   # elan-2 block
   [-1, 1, RepNCSPELAN4, [512, 256, 128, 1]],  # 5

   # avg-conv down
   [-1, 1, ADown, [512]],  # 6-P4/16

   # elan-2 block
   [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 7

   # avg-conv down
   [-1, 1, ADown, [512]],  # 8-P5/32

   # elan-2 block
   [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 9
  ]

# YOLOv9 head
head:
  [
   # elan-spp block
   [-1, 1, SPPELAN, [512, 256]],  # 10

   # up-concat merge
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 7], 1, Concat, [1]],  # cat backbone P4

   # elan-2 block
   [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 13

   # up-concat merge
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 5], 1, Concat, [1]],  # cat backbone P3

   # elan-2 block
   [-1, 1, RepNCSPELAN4, [256, 256, 128, 1]],  # 16 (P3/8-small)

   # avg-conv-down merge
   [-1, 1, ADown, [256]],
   [[-1, 13], 1, Concat, [1]],  # cat head P4

   # elan-2 block
   [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 19 (P4/16-medium)

   # avg-conv-down merge
   [-1, 1, ADown, [512]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5

   # elan-2 block
   [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 22 (P5/32-large)

   # multi-level reversible auxiliary branch

   # routing
   [5, 1, CBLinear, [[256]]], # 23
   [7, 1, CBLinear, [[256, 512]]], # 24
   [9, 1, CBLinear, [[256, 512, 512]]], # 25

   # conv down
   [0, 1, Conv, [64, 3, 2]],  # 26-P1/2

   # conv down
   [-1, 1, Conv, [128, 3, 2]],  # 27-P2/4

   # elan-1 block
   [-1, 1, RepNCSPELAN4, [256, 128, 64, 1]],  # 28

   # avg-conv down fuse
   [-1, 1, ADown, [256]],  # 29-P3/8
   [[23, 24, 25, -1], 1, CBFuse, [[0, 0, 0]]], # 30  

   # elan-2 block
   [-1, 1, RepNCSPELAN4, [512, 256, 128, 1]],  # 31

   # avg-conv down fuse
   [-1, 1, ADown, [512]],  # 32-P4/16
   [[24, 25, -1], 1, CBFuse, [[1, 1]]], # 33 

   # elan-2 block
   [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 34

   # avg-conv down fuse
   [-1, 1, ADown, [512]],  # 35-P5/32
   [[25, -1], 1, CBFuse, [[2]]], # 36

   # elan-2 block
   [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 37

   # detection head

   # detect
   [[31, 34, 37, 16, 19, 22], 1, DualDDetect, [nc]],  # DualDDetect(A3, A4, A5, P3, P4, P5)
  ]

jsjoel commented 6 months ago

I am not a expert in this.But i would like to share some insights. I earlier worked on yolov9 with custom dataset and had this same error. Try reducing the batch size to 8 or 12 and use gpu. It worked for me but i dont how it works like that

Hope this helps!

Gcurrivand commented 4 months ago

Any update on your situation @K4HVH ? Because i get the same error and i can't figure it out

KoniHD commented 4 months ago

Hi there, not an expert here but maybe this helps:

Yolov9 (normal model) uses auxiliary branches for the training to achieve better mAP. In inference these branches have to be "removed" resulting in same mAP but faster inference. Removing these auxiliary branches is done by reparameterizing, resulting the converted weights like yolov9-c-converted.pt you are trying to use. These auxiliary branches are also the reason you have to use train_dual.py for yolov9-c.pt weights.

So the quick answer is:

You cannot use yolov9-converted.pt weights with train_dual.py as it requires weights for the auxiliary branches. (Idk if you can use the converted weights with train.py. You have to try that but #175 (comment) says you'll need to use gelan-c.yaml)
For your situation try python train_dual.py --weights yolov9-c.pt and don't forget to reparameterize for inference later.

More details see #209 (comment) and #131

Gcurrivand commented 4 months ago

I decided to switch to yolov10, it works well now

WongKinYiu / yolov9