about resume training - Githubissues

WongKinYiu / yolor

implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks (https://arxiv.org/abs/2105.04206)

GNU General Public License v3.0

1.99k stars 521 forks source link

about resume training #160

Open amaze567 opened 2 years ago

amaze567 commented 2 years ago

Hello, I have trained a model of yolor-p6 on my dataset in 1000 epochs. However, when I tried to fine-tune the network and loaded the 300 epochs weight, it started to train from zero epoch. Is it normal? Or just I didn't load the old weight. And how can I know if I have successfully loaded the old weight?

Wazaki-Ou commented 2 years ago

As long as you load the "checkpoint.pt" file as your weight, it should be all good. The epochs concern the actual training you are running, so it's normal that they start from zero.

amaze567 commented 2 years ago

@Wazaki-Ou Thanks for your reply. Although I still have a question for it. The loss of the stating epoch of fine-tune training is 0.1604, but the loss of the checkpoint file which I loaded had been trained to 0.02xx. Shouldn't they be the same or not differ too much? That's why I am considering if the program has loaded the checkpoint file.

Wazaki-Ou commented 2 years ago

@amaze567 I'm not sure if that's an incorrect behavior to be honest. I hope someone else who has a better understanding of how resume works can help.

amaze567 commented 2 years ago

@Wazaki-Ou OK. Still thanks for your reply. :)

Wilbertbh-Tan commented 2 years ago

@amaze567 I think I have the same issue and the checkpoint did not actually load so it is training on -- weights '' Have you faced any issue when reloading the checkpoint .pt, the epochs do not start at 0? I seem to be having this issue when I load my checkpoint?

amaze567 commented 2 years ago

@Wilbertbh-Tan Hi, I am still facing the same issue. I tried many times reloading old weights but still trained from zero epoch. Do you have any progress on it?

Wilbertbh-Tan commented 2 years ago

@amaze567 Yes. First ensure your path for the weights is correct. If it isn't it will train from scratch. When you resume training by running train.py, it should resume from where you left off, to change this for fine-tuning, I edited the train.py script to change the starting epoch to what I wanted.

I'm not sure in your case whether the weight file is being created from scratch or it is resuming? Can you verify this by checking the log

qutyyds commented 2 years ago

我想问一下，恢复训练时学习率会发生变化啊。如何保证延续之前的学习率呢？

WongKinYiu commented 2 years ago

會用epoch去schedule裡拿出對應的學習率.