resume not work - Githubissues

GreatV commented 5 months ago

yolo task=detect mode=train model=${PRETRAIN_MODEL_PATH} data=${DATA_CONF_PATH} \
    project=${MODEL_DIR} name=${EXP_NAME} exist_ok=True epochs=100 batch=8 \
    imgsz=640 device=0 warmup_epochs=10 mosaic=0.25 resume=True

  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/project/yolov10/ultralytics/nn/tasks.py", line 93, in forward
    return self.loss(x, *args, **kwargs)
  File "/project/yolov10/ultralytics/nn/tasks.py", line 275, in loss
    return self.criterion(preds, batch)
  File "/project/yolov10/ultralytics/utils/loss.py", line 201, in __call__
    pred_distri, pred_scores = torch.cat([xi.view(feats[0].shape[0], self.no, -1) for xi in feats], 2).split(
  File "/project/yolov10/ultralytics/utils/loss.py", line 201, in <listcomp>
    pred_distri, pred_scores = torch.cat([xi.view(feats[0].shape[0], self.no, -1) for xi in feats], 2).split(
AttributeError: 'str' object has no attribute 'view'

sanha9999 commented 5 months ago

Does the name of the pt file(${PRETRAIN_MODEL_PATH}) include 'yolov10'?

GreatV commented 5 months ago

@sanha9999 no, just 'last.pt'

GreatV commented 5 months ago

I think normally we need to retrain from last.pt, which is the default behavior of yolov8.

jameslahm commented 5 months ago

Thanks for your interest! We just made the YOLOv10 by default in cli. Could you please update the codebase and try again? Thank you!

hyperf0cus commented 5 months ago

resume=True from CLI starts training from 1st epoch...

zylo117 commented 5 months ago

resume=True from CLI starts training from 1st epoch...

same here

GreatV commented 5 months ago

@zylo117 you need set model=last.pt resume=True

zylo117 commented 5 months ago

@zylo117 you need set model=last.pt resume=True

you mean path/to/last.pt or just last.pt, the former crashs because it can't find the weight and the latter just starts from epoch 1 neither works

sanha9999 commented 5 months ago

@zylo117 Have you tried running code updates?

zylo117 commented 5 months ago

@zylo117 Have you tried running code updates?

it's the latest

sanha9999 commented 5 months ago

could you tell me the command you CLI entered?

zylo117 commented 5 months ago

here is some log. The resume is always none, is this normal? I ran with yolo detect train data=xxx.yaml model=path/to/last.pt epochs=100 batch=32 imgsz=640 device=0,1,2,3 cache=true amp=true save_period=1 plots=true resume=true

zylo117 commented 5 months ago

the training was working, and I think resume training can load the weight normally judging from the low loss, but the epoch starts from 1

zylo117 commented 5 months ago

here is some log. The resume is always none, is this normal? I ran with yolo detect train data=xxx.yaml model=path/to/last.pt epochs=100 batch=32 imgsz=640 device=0,1,2,3 cache=true amp=true save_period=1 plots=true resume=true

the epoch starts from 1 even though I set model=path/to/last.pt and resume=true

sanha9999 commented 5 months ago

could you change command?

resume=true -> resume=True

zylo117 commented 5 months ago

could you change command?
resume=true -> resume=True

I try true and True, it's the same

zylo117 commented 5 months ago

337881671-9148deac-8eee-44d9-9573-6fad5b6deb1c I think this is why resume not working, but I dont know how. resume is none here. even though i set model and resume

zylo117 commented 5 months ago

after a lot of debug, find out somehow the self.resume here became None, but ckpt is the weight

GreatV commented 5 months ago

@zylo117 you may try set exist_ok=True

zylo117 commented 5 months ago

after a lot of debug, find out somehow the self.resume here became None, but ckpt is the weight

well, I have to force self.resume to be True here and it's fixed. Now it can resume training. ultralytics is just too damn unreliable. And the coco annotaion convert is just unnecessary and the whole framework seems to work for only one single dataset. It's just not meant for a real-world task. I recommend mmdetection, it is a highly stable and customizable framework, I really wish you can impl your next work on it.

@hyperf0cus you can try my solution

hyperf0cus commented 5 months ago

@zylo117 Now it's resume, but this approach so strange and abnormal)

JulioZhao97 commented 5 months ago

The reason is that the trainer of the pretrained.pt overwrites the arguments, that's why resume=False even if you pass resume=True, but still don't know how to fix this.

leonnil commented 5 months ago

Hi everyone! we found that this problem has been fixed in ultralytics v8.1.41 (https://github.com/ultralytics/ultralytics/pull/9453), but yolov10 is based on v8.1.34. So we fixed this manually in our codebase. Please update the code to the latest version. Thank you so much!

babakbch commented 2 months ago

Screenshot from 2024-09-10 08-44-37 Hi all While the resume problem is resolved the optimizer resets to default (auto) and resets the lr and momentum. ALthough the optimizer stat is being written to the checkpoint, it cannot be loaded when resuming training, so the trend of parameters changes and in my case losses start to rise. And also the epoch number starts from 1 rather than continue.

babakbch commented 2 months ago

Hi all While the resume problem is resolved the optimizer resets to default (auto) and resets the lr and momentum. ALthough the optimizer stat is being written to the checkpoint, it cannot be loaded when resuming training, so the trend of parameters changes and in my case losses start to rise. And also the epoch number starts from 1 rather than continue.

So I just had to make sure the optimizer state isn't stripped from the checkpoints and now it's working properly

THU-MIG / yolov10

resume not work #184