Open GreatV opened 5 months ago
Does the name of the pt file(${PRETRAIN_MODEL_PATH}) include 'yolov10'?
@sanha9999 no, just 'last.pt'
I think normally we need to retrain from last.pt
, which is the default behavior of yolov8.
Thanks for your interest! We just made the YOLOv10
by default in cli. Could you please update the codebase and try again? Thank you!
resume=True from CLI starts training from 1st epoch...
resume=True from CLI starts training from 1st epoch...
same here
@zylo117 you need set model=last.pt resume=True
@zylo117 you need set
model=last.pt resume=True
you mean path/to/last.pt or just last.pt, the former crashs because it can't find the weight and the latter just starts from epoch 1 neither works
@zylo117 Have you tried running code updates?
@zylo117 Have you tried running code updates?
it's the latest
could you tell me the command you CLI entered?
here is some log. The resume is always none, is this normal?
I ran with yolo detect train data=xxx.yaml model=path/to/last.pt epochs=100 batch=32 imgsz=640 device=0,1,2,3 cache=true amp=true save_period=1 plots=true resume=true
the training was working, and I think resume training can load the weight normally judging from the low loss, but the epoch starts from 1
here is some log. The resume is always none, is this normal? I ran with
yolo detect train data=xxx.yaml model=path/to/last.pt epochs=100 batch=32 imgsz=640 device=0,1,2,3 cache=true amp=true save_period=1 plots=true resume=true
the epoch starts from 1 even though I set model=path/to/last.pt and resume=true
could you change command?
resume=true -> resume=True
could you change command?
resume=true -> resume=True
I try true and True, it's the same
I think this is why resume not working, but I dont know how. resume is none here. even though i set model and resume
after a lot of debug, find out somehow the self.resume here became None, but ckpt is the weight
@zylo117 you may try set exist_ok=True
after a lot of debug, find out somehow the self.resume here became None, but ckpt is the weight
well, I have to force self.resume to be True here and it's fixed. Now it can resume training. ultralytics is just too damn unreliable. And the coco annotaion convert is just unnecessary and the whole framework seems to work for only one single dataset. It's just not meant for a real-world task. I recommend mmdetection, it is a highly stable and customizable framework, I really wish you can impl your next work on it.
@hyperf0cus you can try my solution
@zylo117 Now it's resume, but this approach so strange and abnormal)
The reason is that the trainer
of the pretrained.pt
overwrites the arguments, that's why resume=False
even if you pass resume=True
, but still don't know how to fix this.
Hi everyone! we found that this problem has been fixed in ultralytics v8.1.41 (https://github.com/ultralytics/ultralytics/pull/9453), but yolov10 is based on v8.1.34. So we fixed this manually in our codebase. Please update the code to the latest version. Thank you so much!
Hi all While the resume problem is resolved the optimizer resets to default (auto) and resets the lr and momentum. ALthough the optimizer stat is being written to the checkpoint, it cannot be loaded when resuming training, so the trend of parameters changes and in my case losses start to rise. And also the epoch number starts from 1 rather than continue.
Hi all While the resume problem is resolved the optimizer resets to default (auto) and resets the lr and momentum. ALthough the optimizer stat is being written to the checkpoint, it cannot be loaded when resuming training, so the trend of parameters changes and in my case losses start to rise. And also the epoch number starts from 1 rather than continue.
So I just had to make sure the optimizer state isn't stripped from the checkpoints and now it's working properly