Resume training - Githubissues

nemeziz69 commented 1 year ago

My training stop due to PC accidentally shut down. Is it possible to resume back the training? If yes, how I'm gonna do it?

Daraan commented 1 year ago

Please correct me as I could be wrong. Here they use the pytorch-accelerated framework, I do not see checkpointing activated by default - so I sadly doubt that it is not possible to recover it.

You have to do it manually enable checkpointing beforehand with the trainer; see here https://pytorch-accelerated.readthedocs.io/en/latest/callbacks.html.

varshanth commented 1 year ago

From the example provided in the train_cars.py, the best model is saved as "best_model.pt". You can use the load_checkpoint method before you start the training to resume from the last checkpoint. The optimizer state_dict should also be saved so you should be able to resume training with the checkpointed LR as well. You just need to modify your existing training script a bit.

nemeziz69 commented 1 year ago

From the example provided in the train_cars.py, the best model is saved as "best_model.pt". You can use the load_checkpoint method before you start the training to resume from the last checkpoint. The optimizer state_dict should also be saved so you should be able to resume training with the checkpointed LR as well. You just need to modify your existing training script a bit.

According to @varshanth , I tried to use load_checkpoint method before trainer.train to resume my training. Something like this:

    if RESUME_LOCAL_PATH is not None:
        print(f"Resume load checkpoint from: {RESUME_LOCAL_PATH}")
        trainer.load_checkpoint(RESUME_LOCAL_PATH)
        print(optimizer)

    # run training
    trainer.train(
        num_epochs=num_epochs,
        train_dataset=train_yds,
        eval_dataset=eval_yds,
        per_device_batch_size=batch_size,
        create_scheduler_fn=CosineLrScheduler.create_scheduler_fn(
            num_warmup_epochs=NUM_WARMUP_EPOCH,
            num_cooldown_epochs=NUM_COOLDOWN_EPOCH,
            k_decay=2,
        ),
        collate_fn=yolov7_collate_fn,
        gradient_accumulation_steps=num_accumulate_steps,
    )

But the error occur,

raceback (most recent call last):
  File "train.py", line 405, in <module>
    main()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\func_to_script\core.py", line 108, in scripted_function
    return func(**args)
  File "train.py", line 389, in main
    trainer.train(
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 468, in train
    self._run_training()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 679, in _run_training
    self._run_train_epoch(self._train_dataloader)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 758, in _run_train_epoch
    self.optimizer_step()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 323, in optimizer_step
    self.optimizer.step()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\accelerate\optimizer.py", line 134, in step
    self.optimizer.step(closure)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\autograd\grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\sgd.py", line 110, in step
    F.sgd(params_with_grad,
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\_functional.py", line 173, in sgd
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 1

Is my implementation correct?

varshanth commented 1 year ago

From the example provided in the train_cars.py, the best model is saved as "best_model.pt". You can use the load_checkpoint method before you start the training to resume from the last checkpoint. The optimizer state_dict should also be saved so you should be able to resume training with the checkpointed LR as well. You just need to modify your existing training script a bit.

According to @varshanth , I tried to use load_checkpoint method before trainer.train to resume my training. Something like this:

    if RESUME_LOCAL_PATH is not None:
        print(f"Resume load checkpoint from: {RESUME_LOCAL_PATH}")
        trainer.load_checkpoint(RESUME_LOCAL_PATH)
        print(optimizer)

    # run training
    trainer.train(
        num_epochs=num_epochs,
        train_dataset=train_yds,
        eval_dataset=eval_yds,
        per_device_batch_size=batch_size,
        create_scheduler_fn=CosineLrScheduler.create_scheduler_fn(
            num_warmup_epochs=NUM_WARMUP_EPOCH,
            num_cooldown_epochs=NUM_COOLDOWN_EPOCH,
            k_decay=2,
        ),
        collate_fn=yolov7_collate_fn,
        gradient_accumulation_steps=num_accumulate_steps,
    )

But the error occur,

raceback (most recent call last):
  File "train.py", line 405, in <module>
    main()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\func_to_script\core.py", line 108, in scripted_function
    return func(**args)
  File "train.py", line 389, in main
    trainer.train(
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 468, in train
    self._run_training()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 679, in _run_training
    self._run_train_epoch(self._train_dataloader)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 758, in _run_train_epoch
    self.optimizer_step()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 323, in optimizer_step
    self.optimizer.step()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\accelerate\optimizer.py", line 134, in step
    self.optimizer.step(closure)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\autograd\grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\sgd.py", line 110, in step
    F.sgd(params_with_grad,
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\_functional.py", line 173, in sgd
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 1

Is my implementation correct?

The error you received basically says that the model that you instantiated and the model that you loaded through the checkpoint are not the same. There is a mismatch in the parameter shapes that the optimizer had for calculating the gradient with momentum and the parameters loaded through the checkpoint for that particular layer. Can you please double-check to see if the model you trained and the model you loaded are the same with no changes made between the save and the load?

nemeziz69 commented 1 year ago

Hi @varshanth , I'm confirm that model trained and model loaded are the same model, with no changes made. However, when printing out the optimizer before and after the load_checkpoint method, there's something difference called initial_lr:

print(optimizer)
if RESUME_LOCAL_PATH is not None:
    print(f"Resume load checkpoint from: {RESUME_LOCAL_PATH}")
    trainer.load_checkpoint(RESUME_LOCAL_PATH)
    print(optimizer)

Output:

SGD (
Parameter Group 0
    dampening: 0
    lr: 0.01
    momentum: 0.937
    nesterov: True
    weight_decay: 0

Parameter Group 1
    dampening: 0
    lr: 0.01
    momentum: 0.937
    nesterov: True
    weight_decay: 0.000515625
)
Resume load checkpoint from: 230530_145306_ep1_0.00771408797687863_best_model.pt
SGD (
Parameter Group 0
    dampening: 0
    initial_lr: 0.01
    lr: 1e-06
    momentum: 0.937
    nesterov: True
    weight_decay: 0

Parameter Group 1
    dampening: 0
    initial_lr: 0.01
    lr: 1e-06
    momentum: 0.937
    nesterov: True
    weight_decay: 0.000515625
)

Is this could be the issue?

Chris-hughes10 / Yolov7-training

Resume training #13