WongKinYiu / yolov9

Implementation of paper - YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
GNU General Public License v3.0
9k stars 1.43k forks source link

Continued training? #323

Closed Alkohole closed 7 months ago

Alkohole commented 7 months ago

Hello,

Can I send the model to be re-trained on new dataset using your script?

Can I just have the train_dual.py script run with this flag --weights runs/train/exp4/weights/best.pt?

Thank you for your work, your attention and I apologize for my English)

WongKinYiu commented 7 months ago

--resume runs/train/exp4/weights/best.pt

Alkohole commented 7 months ago

Thank you very much!

Alkohole commented 7 months ago

I can't continue training with the --resume flag, the script says training for 100 epochs is complete and there is nothing to continue:

Traceback (most recent call last):
  File "/workspace/yolov9/train_dual.py", line 644, in <module>
    main(opt)
  File "/workspace/yolov9/train_dual.py", line 538, in main
    train(opt.hyp, opt, device, callbacks)
  File "/workspace/yolov9/train_dual.py", line 174, in train
    best_fitness, start_epoch, epochs = smart_resume(ckpt, optimizer, ema, weights, epochs, resume)
  File "/workspace/yolov9/utils/torch_utils.py", line 469, in smart_resume
    assert start_epoch > 0, f'{weights} training to {epochs} epochs is finished, nothing to resume.\n' \
AssertionError: /workspace/yolov9/runs/train/exp4/weights/best.pt training to 100 epochs is finished, nothing to resume.
Start a new training without --resume, i.e. 'python train.py --weights /workspace/yolov9/runs/train/exp4/weights/best.pt'
Traceback (most recent call last):
  File "/workspace/yolov9/train_dual.py", line 644, in <module>
    main(opt)
  File "/workspace/yolov9/train_dual.py", line 538, in main
    train(opt.hyp, opt, device, callbacks)
  File "/workspace/yolov9/train_dual.py", line 174, in train
    best_fitness, start_epoch, epochs = smart_resume(ckpt, optimizer, ema, weights, epochs, resume)
  File "/workspace/yolov9/utils/torch_utils.py", line 469, in smart_resume
    assert start_epoch > 0, f'{weights} training to {epochs} epochs is finished, nothing to resume.\n' \
AssertionError: /workspace/yolov9/runs/train/exp4/weights/best.pt training to 100 epochs is finished, nothing to resume.
Start a new training without --resume, i.e. 'python train.py --weights /workspace/yolov9/runs/train/exp4/weights/best.pt'

The command looks like this:

python train_dual.py --workers 8 --batch 16 --img 640 --epochs 150 --data /workspace/data.yaml --resume /workspace/yolov9/runs/train/exp4/weights/best.pt --device 0 --cfg /workspace/yolov9/models/detect/yolov9_custom.yaml --hyp /workspace/yolov9/data/hyps/hyp.scratch-high.yaml

What am I doing wrong?

Full terminal response: ``` /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn( git root error: Cmd('git') failed due to: exit code(128) cmdline: git rev-parse --show-toplevel stderr: 'fatal: detected dubious ownership in repository at '/workspace/yolov9' To add an exception for this directory, call: git config --global --add safe.directory /workspace/yolov9' git root error: Cmd('git') failed due to: exit code(128) cmdline: git rev-parse --show-toplevel stderr: 'fatal: detected dubious ownership in repository at '/workspace/yolov9' To add an exception for this directory, call: git config --global --add safe.directory /workspace/yolov9' wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: (30 second timeout) wandb: W&B disabled due to login timeout. train_dual: weights=, cfg=/workspace/yolov9/models/detect/yolov9_custom.yaml, data=/workspace/data.yaml, hyp=/workspace/yolov9/data/hyps/hyp.scratch-high.yaml, epochs=150, batch_size=16, imgsz=640, rect=False, resume=/workspace/yolov9/runs/train/exp4/weights/best.pt, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, flat_cos_lr=False, fixed_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, min_items=0, close_mosaic=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest fatal: detected dubious ownership in repository at '/workspace/yolov9' To add an exception for this directory, call: git config --global --add safe.directory /workspace/yolov9 YOLO 🚀 2024-4-5 Python-3.10.14 torch-2.2.2+cu121 CUDA:0 (NVIDIA RTX A4000, 16109MiB) hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, cls_pw=1.0, obj=0.7, obj_pw=1.0, dfl=1.5, iou_t=0.2, anchor_t=5.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.9, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.15, copy_paste=0.3 ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLO 🚀 in ClearML Comet: run 'pip install comet_ml' to automatically track and visualize YOLO 🚀 runs in Comet TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/ from n params module arguments activation: nn.ReLU() 0 -1 1 0 models.common.Silence [] 1 -1 1 1856 models.common.Conv [3, 64, 3, 2] 2 -1 1 73984 models.common.Conv [64, 128, 3, 2] 3 -1 1 212864 models.common.RepNCSPELAN4 [128, 256, 128, 64, 1] 4 -1 1 590336 models.common.Conv [256, 256, 3, 2] 5 -1 1 847616 models.common.RepNCSPELAN4 [256, 512, 256, 128, 1] 6 -1 1 2360320 models.common.Conv [512, 512, 3, 2] 7 -1 1 2857472 models.common.RepNCSPELAN4 [512, 512, 512, 256, 1] 8 -1 1 2360320 models.common.Conv [512, 512, 3, 2] 9 -1 1 2857472 models.common.RepNCSPELAN4 [512, 512, 512, 256, 1] 10 -1 1 656896 models.common.SPPELAN [512, 512, 256] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 7] 1 0 models.common.Concat [1] 13 -1 1 3119616 models.common.RepNCSPELAN4 [1024, 512, 512, 256, 1] 14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 15 [-1, 5] 1 0 models.common.Concat [1] 16 -1 1 912640 models.common.RepNCSPELAN4 [1024, 256, 256, 128, 1] 17 -1 1 590336 models.common.Conv [256, 256, 3, 2] 18 [-1, 13] 1 0 models.common.Concat [1] 19 -1 1 2988544 models.common.RepNCSPELAN4 [768, 512, 512, 256, 1] 20 -1 1 2360320 models.common.Conv [512, 512, 3, 2] 21 [-1, 10] 1 0 models.common.Concat [1] 22 -1 1 3119616 models.common.RepNCSPELAN4 [1024, 512, 512, 256, 1] 23 5 1 131328 models.common.CBLinear [512, [256]] 24 7 1 393984 models.common.CBLinear [512, [256, 512]] 25 9 1 656640 models.common.CBLinear [512, [256, 512, 512]] 26 0 1 1856 models.common.Conv [3, 64, 3, 2] 27 -1 1 73984 models.common.Conv [64, 128, 3, 2] 28 -1 1 212864 models.common.RepNCSPELAN4 [128, 256, 128, 64, 1] 29 -1 1 590336 models.common.Conv [256, 256, 3, 2] 30 [23, 24, 25, -1] 1 0 models.common.CBFuse [[0, 0, 0]] 31 -1 1 847616 models.common.RepNCSPELAN4 [256, 512, 256, 128, 1] 32 -1 1 2360320 models.common.Conv [512, 512, 3, 2] 33 [24, 25, -1] 1 0 models.common.CBFuse [[1, 1]] 34 -1 1 2857472 models.common.RepNCSPELAN4 [512, 512, 512, 256, 1] 35 -1 1 2360320 models.common.Conv [512, 512, 3, 2] 36 [25, -1] 1 0 models.common.CBFuse [[2]] 37 -1 1 2857472 models.common.RepNCSPELAN4 [512, 512, 512, 256, 1] 38[31, 34, 37, 16, 19, 22] 1 21542822 models.yolo.DualDDetect [1, [512, 512, 512, 256, 512, 512]] [W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware. Model summary: 930 layers, 60797222 parameters, 60797190 gradients, 266.1 GFLOPs Transferred 1412/1412 items from /workspace/yolov9/runs/train/exp4/weights/best.pt AMP: checks passed ✅ optimizer: SGD(lr=0.01) with parameter groups 230 weight(decay=0.0), 247 weight(decay=0.0005), 245 bias Traceback (most recent call last): File "/workspace/yolov9/train_dual.py", line 644, in main(opt) File "/workspace/yolov9/train_dual.py", line 538, in main train(opt.hyp, opt, device, callbacks) File "/workspace/yolov9/train_dual.py", line 174, in train best_fitness, start_epoch, epochs = smart_resume(ckpt, optimizer, ema, weights, epochs, resume) File "/workspace/yolov9/utils/torch_utils.py", line 469, in smart_resume assert start_epoch > 0, f'{weights} training to {epochs} epochs is finished, nothing to resume.\n' \ AssertionError: /workspace/yolov9/runs/train/exp4/weights/best.pt training to 100 epochs is finished, nothing to resume. Start a new training without --resume, i.e. 'python train.py --weights /workspace/yolov9/runs/train/exp4/weights/best.pt' Traceback (most recent call last): File "/workspace/yolov9/train_dual.py", line 644, in main(opt) File "/workspace/yolov9/train_dual.py", line 538, in main train(opt.hyp, opt, device, callbacks) File "/workspace/yolov9/train_dual.py", line 174, in train best_fitness, start_epoch, epochs = smart_resume(ckpt, optimizer, ema, weights, epochs, resume) File "/workspace/yolov9/utils/torch_utils.py", line 469, in smart_resume assert start_epoch > 0, f'{weights} training to {epochs} epochs is finished, nothing to resume.\n' \ AssertionError: /workspace/yolov9/runs/train/exp4/weights/best.pt training to 100 epochs is finished, nothing to resume. Start a new training without --resume, i.e. 'python train.py --weights /workspace/yolov9/runs/train/exp4/weights/best.pt' ```
WongKinYiu commented 7 months ago

For transfer learning: --weights runs/train/exp4/weights/best.pt For resume training: --resume runs/train/exp4/weights/best.pt

dsbyprateekg commented 7 months ago

I can't continue training with the --resume flag, the script says training for 100 epochs is complete and there is nothing to continue:

Traceback (most recent call last):
  File "/workspace/yolov9/train_dual.py", line 644, in <module>
    main(opt)
  File "/workspace/yolov9/train_dual.py", line 538, in main
    train(opt.hyp, opt, device, callbacks)
  File "/workspace/yolov9/train_dual.py", line 174, in train
    best_fitness, start_epoch, epochs = smart_resume(ckpt, optimizer, ema, weights, epochs, resume)
  File "/workspace/yolov9/utils/torch_utils.py", line 469, in smart_resume
    assert start_epoch > 0, f'{weights} training to {epochs} epochs is finished, nothing to resume.\n' \
AssertionError: /workspace/yolov9/runs/train/exp4/weights/best.pt training to 100 epochs is finished, nothing to resume.
Start a new training without --resume, i.e. 'python train.py --weights /workspace/yolov9/runs/train/exp4/weights/best.pt'
Traceback (most recent call last):
  File "/workspace/yolov9/train_dual.py", line 644, in <module>
    main(opt)
  File "/workspace/yolov9/train_dual.py", line 538, in main
    train(opt.hyp, opt, device, callbacks)
  File "/workspace/yolov9/train_dual.py", line 174, in train
    best_fitness, start_epoch, epochs = smart_resume(ckpt, optimizer, ema, weights, epochs, resume)
  File "/workspace/yolov9/utils/torch_utils.py", line 469, in smart_resume
    assert start_epoch > 0, f'{weights} training to {epochs} epochs is finished, nothing to resume.\n' \
AssertionError: /workspace/yolov9/runs/train/exp4/weights/best.pt training to 100 epochs is finished, nothing to resume.
Start a new training without --resume, i.e. 'python train.py --weights /workspace/yolov9/runs/train/exp4/weights/best.pt'

The command looks like this:

python train_dual.py --workers 8 --batch 16 --img 640 --epochs 150 --data /workspace/data.yaml --resume /workspace/yolov9/runs/train/exp4/weights/best.pt --device 0 --cfg /workspace/yolov9/models/detect/yolov9_custom.yaml --hyp /workspace/yolov9/data/hyps/hyp.scratch-high.yaml

What am I doing wrong?

Full terminal response:

It means you have down your first training with 100 epochs. But now you want to do training for 150 epochs which cannot be done by resuming the first training since you set the epochs lesser. You need to start the training with 150 epochs from starting and then you can resume the training if it stops in between.

Alkohole commented 7 months ago

Aha, I understood, --resume to resume an interrupted training session, not to continue a completed training session on new data.

Okay, thank you all for your help.

asadikani commented 3 months ago

This worked for me : python train_dual.py --workers 1 --device cpu --batch 4 --data datasets/data.yaml --img 640 --cfg models/detect/yolov9-c.yaml --weights '' --name yolov9-c --hyp hyp.scratch-high.yaml --min-items 0 --epochs 10 --close-mosaic 15 --resume runs/train/yolov9-c/weights/last.pt

asadikani commented 3 months ago

after 1 epoch witch worked on pc, and shtdowned pc. remaining epochs with above function was 4.