Error while resuming training

WongKinYiu / yolor

implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks (https://arxiv.org/abs/2105.04206)

GNU General Public License v3.0

1.99k stars 520 forks source link

Error while resuming training #116

Closed mariusfm54 closed 2 years ago

mariusfm54 commented 2 years ago

Hello, I run a training and stopped it before it ends. When I try to resume the training using python3 train.py --resume I got the following error: Traceback (most recent call last): File "train.py", line 537, in <module> train(hyp, opt, device, tb_writer, wandb) File "train.py", line 81, in train model = Darknet(opt.cfg).to(device) # create File "yolor/models/models.py", line 530, in __init__ self.module_defs = parse_model_cfg(cfg) File "yolor/utils/parse_config.py", line 13, in parse_model_cfg with open(path, 'r') as f: FileNotFoundError: [Errno 2] No such file or directory: '.cfg'

I also try to run: python3 train.py --cfg my_cfg.cfg --resume but I got the same error.

Then I noticed that in train.py, l.502 there is the following line: opt.cfg, opt.weights, opt.resume = '', ckpt, True So the cfg filename is set to '', I tried to modify the line this way: opt.weights, opt.resume = ckpt, True but still got the same error.

Do you have any clue?

WongKinYiu commented 2 years ago

python3 train.py --cfg 'my_cfg.cfg' --weights 'runs/train/xxx/weights/last.pt'

mariusfm54 commented 2 years ago

Great, thank you!

santhraul commented 2 years ago

This is creating another run folder hence showing as different coloured tensorboard plot. Is there any way we can continue the same training instance and update last traing directory so that new tesorboard plot will not be cteated?

mariusfm54 commented 2 years ago

You have to use the exact same command line that the first one you used, simply replacing the weights by last.pt.

What are you calling tensorboard plot? Yes it creates a new folder every time you resume your training but the training will continue where you stopped it. At the end, the different plots in the yolor/runs/train/name/ folder represent the training over all the epochs for me, I do not see any different colors.

mbl1234 commented 2 years ago

这几种方法都没有解决，怎么办 #160

x-yy0 commented 2 years ago

python train.py --cfg 'mycfg.cfg' --resume 'runs/train/xxx/weights/last.pt'

and modify the line in this way: opt.cfg, opt.weights, opt.resume = '', ckpt, True -> opt.weights, opt.resume = ckpt, True

burak43 commented 2 years ago

@WongKinYiu @yydc-0 When I use python tune.py --cfg 'mycfg.cfg' --resume 'my_ckpt.pt', the code stucks in line model = DDP(model, device_ids=[opt.local_rank], output_device=opt.local_rank) Any suggestions?

qutyyds commented 2 years ago

我想问一下，恢复训练时学习率会发生变化啊。如何保证延续之前的学习率呢？