Closed nemeziz69 closed 1 year ago
Please correct me as I could be wrong. Here they use the pytorch-accelerated framework, I do not see checkpointing activated by default - so I sadly doubt that it is not possible to recover it.
You have to do it manually enable checkpointing beforehand with the trainer; see here https://pytorch-accelerated.readthedocs.io/en/latest/callbacks.html.
From the example provided in the train_cars.py, the best model is saved as "best_model.pt". You can use the load_checkpoint
method before you start the training to resume from the last checkpoint. The optimizer state_dict should also be saved so you should be able to resume training with the checkpointed LR as well. You just need to modify your existing training script a bit.
From the example provided in the train_cars.py, the best model is saved as "best_model.pt". You can use the
load_checkpoint
method before you start the training to resume from the last checkpoint. The optimizer state_dict should also be saved so you should be able to resume training with the checkpointed LR as well. You just need to modify your existing training script a bit.
According to @varshanth , I tried to use load_checkpoint method before trainer.train to resume my training. Something like this:
if RESUME_LOCAL_PATH is not None:
print(f"Resume load checkpoint from: {RESUME_LOCAL_PATH}")
trainer.load_checkpoint(RESUME_LOCAL_PATH)
print(optimizer)
# run training
trainer.train(
num_epochs=num_epochs,
train_dataset=train_yds,
eval_dataset=eval_yds,
per_device_batch_size=batch_size,
create_scheduler_fn=CosineLrScheduler.create_scheduler_fn(
num_warmup_epochs=NUM_WARMUP_EPOCH,
num_cooldown_epochs=NUM_COOLDOWN_EPOCH,
k_decay=2,
),
collate_fn=yolov7_collate_fn,
gradient_accumulation_steps=num_accumulate_steps,
)
But the error occur,
raceback (most recent call last):
File "train.py", line 405, in <module>
main()
File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\func_to_script\core.py", line 108, in scripted_function
return func(**args)
File "train.py", line 389, in main
trainer.train(
File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 468, in train
self._run_training()
File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 679, in _run_training
self._run_train_epoch(self._train_dataloader)
File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 758, in _run_train_epoch
self.optimizer_step()
File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 323, in optimizer_step
self.optimizer.step()
File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\accelerate\optimizer.py", line 134, in step
self.optimizer.step(closure)
File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\autograd\grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\sgd.py", line 110, in step
F.sgd(params_with_grad,
File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\_functional.py", line 173, in sgd
buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 1
Is my implementation correct?
From the example provided in the train_cars.py, the best model is saved as "best_model.pt". You can use the
load_checkpoint
method before you start the training to resume from the last checkpoint. The optimizer state_dict should also be saved so you should be able to resume training with the checkpointed LR as well. You just need to modify your existing training script a bit.According to @varshanth , I tried to use load_checkpoint method before trainer.train to resume my training. Something like this:
if RESUME_LOCAL_PATH is not None: print(f"Resume load checkpoint from: {RESUME_LOCAL_PATH}") trainer.load_checkpoint(RESUME_LOCAL_PATH) print(optimizer) # run training trainer.train( num_epochs=num_epochs, train_dataset=train_yds, eval_dataset=eval_yds, per_device_batch_size=batch_size, create_scheduler_fn=CosineLrScheduler.create_scheduler_fn( num_warmup_epochs=NUM_WARMUP_EPOCH, num_cooldown_epochs=NUM_COOLDOWN_EPOCH, k_decay=2, ), collate_fn=yolov7_collate_fn, gradient_accumulation_steps=num_accumulate_steps, )
But the error occur,
raceback (most recent call last): File "train.py", line 405, in <module> main() File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\func_to_script\core.py", line 108, in scripted_function return func(**args) File "train.py", line 389, in main trainer.train( File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 468, in train self._run_training() File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 679, in _run_training self._run_train_epoch(self._train_dataloader) File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 758, in _run_train_epoch self.optimizer_step() File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 323, in optimizer_step self.optimizer.step() File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\accelerate\optimizer.py", line 134, in step self.optimizer.step(closure) File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\optimizer.py", line 88, in wrapper return func(*args, **kwargs) File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\autograd\grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\sgd.py", line 110, in step F.sgd(params_with_grad, File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\_functional.py", line 173, in sgd buf.mul_(momentum).add_(d_p, alpha=1 - dampening) RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 1
Is my implementation correct?
The error you received basically says that the model that you instantiated and the model that you loaded through the checkpoint are not the same. There is a mismatch in the parameter shapes that the optimizer had for calculating the gradient with momentum and the parameters loaded through the checkpoint for that particular layer. Can you please double-check to see if the model you trained and the model you loaded are the same with no changes made between the save and the load?
Hi @varshanth , I'm confirm that model trained and model loaded are the same model, with no changes made. However, when printing out the optimizer
before and after the load_checkpoint
method, there's something difference called initial_lr
:
print(optimizer)
if RESUME_LOCAL_PATH is not None:
print(f"Resume load checkpoint from: {RESUME_LOCAL_PATH}")
trainer.load_checkpoint(RESUME_LOCAL_PATH)
print(optimizer)
Output:
SGD (
Parameter Group 0
dampening: 0
lr: 0.01
momentum: 0.937
nesterov: True
weight_decay: 0
Parameter Group 1
dampening: 0
lr: 0.01
momentum: 0.937
nesterov: True
weight_decay: 0.000515625
)
Resume load checkpoint from: 230530_145306_ep1_0.00771408797687863_best_model.pt
SGD (
Parameter Group 0
dampening: 0
initial_lr: 0.01
lr: 1e-06
momentum: 0.937
nesterov: True
weight_decay: 0
Parameter Group 1
dampening: 0
initial_lr: 0.01
lr: 1e-06
momentum: 0.937
nesterov: True
weight_decay: 0.000515625
)
Is this could be the issue?
My training stop due to PC accidentally shut down. Is it possible to resume back the training? If yes, how I'm gonna do it?