Closed AdnanMunir294 closed 1 year ago
As I am training Yolo-Nas my system got stuck to I have ckpt.best_path file. how to resume my training for left epochs? Thanks
To resume you just need to set training_params.resume = True
Feel free to check the documentation for loading checkpoints if you're interested, it explains more generally how to use checkpoints, with a small section for resuming specifically (Resuming Training
)
@Louis-Dupont can you please write the command line code
Let's say you trained on training_hyperparams.max_epochs=50
and now want to run for 300
instead:
python -m super_gradients.train_from_recipe --config-name=<config-name> experiment_name=<experiment-name> training_hyperparams.resume=True training_hyperparams.max_epochs=300
Make sure to keep the same experiment_name
as in your previous training since it will be used to locate and update the checkpoints.
If you want to create a new experiment based on an existing checkpoint, I invite you to check out this documentation page
...
training_params = {...} # Whatever you have been using
training_params['resume'] = True # This will let the Trainer know to load the existing checkpoint and start from the last epoch
training_params['max_epochs'] = 300 # Only if you want to change the number of epochs compared to the previous run
# The following remains the same
trainer.train(model=model,
training_params=training_params,
train_loader=train_dataloader,
valid_loader=valid_dataloader)
Does this answer your question? @absmahi01 @AdnanMunir294
Mr. Adnan Munir Graduate Assistant Dept of Computer Engineering King Fahd University of Petroleum and Minerals
On Sat, May 6, 2023 at 6:51 PM Louis-Dupont @.***> wrote:
Let's say you trained on training_hyperparams.max_epochs=50 and now want to run for 300 instead: If you are training with a recipe https://docs.deci.ai/super-gradients/src/super_gradients/recipes/Training_Recipes.html
python -m super_gradients.train_from_recipe --config-name=
experiment_name= training_hyperparams.resume=True training_hyperparams.max_epochs=300 Make sure to keep the same experiment_name as in your previous training since it will be used to locate and update the checkpoints.
If you want to create a new experiment based on an existing checkpoint, I invite you to checkout this documentation page https://docs.deci.ai/super-gradients/documentation/source/Checkpoints.html#loading-checkpoints If you are running from a script
... training_params = {...} # Whatever you have been usingtraining_params['resume'] = True # This will let the Trainer know to load the existing checkpoint and start from the last epochtraining_params['max_epochs'] = 300 # Only if you want to change the number of epochs compared to the previous run
The following remains the sametrainer.train(model=model,
training_params=training_params, train_loader=train_dataloader, valid_loader=valid_dataloader)
Does this answer your question? @absmahi01 https://github.com/absmahi01 @AdnanMunir294 https://github.com/AdnanMunir294
— Reply to this email directly, view it on GitHub https://github.com/Deci-AI/super-gradients/issues/940#issuecomment-1537169688, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASWEQLZRU444EL2X4MULI3TXEZXRPANCNFSM6AAAAAAXX6HQXM . You are receiving this because you were mentioned.Message ID: @.***>
Mr. Adnan Munir Graduate Assistant Dept of Computer Engineering King Fahd University of Petroleum and Minerals
On Sat, May 6, 2023 at 7:08 PM Adnan Munir khokhar @.***> wrote:
but where we will load our ".pth" weights file in script ?
Mr. Adnan Munir Graduate Assistant Dept of Computer Engineering King Fahd University of Petroleum and Minerals
On Sat, May 6, 2023 at 6:51 PM Louis-Dupont @.***> wrote:
Let's say you trained on training_hyperparams.max_epochs=50 and now want to run for 300 instead: If you are training with a recipe https://docs.deci.ai/super-gradients/src/super_gradients/recipes/Training_Recipes.html
python -m super_gradients.train_from_recipe --config-name=
experiment_name= training_hyperparams.resume=True training_hyperparams.max_epochs=300 Make sure to keep the same experiment_name as in your previous training since it will be used to locate and update the checkpoints.
If you want to create a new experiment based on an existing checkpoint, I invite you to checkout this documentation page https://docs.deci.ai/super-gradients/documentation/source/Checkpoints.html#loading-checkpoints If you are running from a script
... training_params = {...} # Whatever you have been usingtraining_params['resume'] = True # This will let the Trainer know to load the existing checkpoint and start from the last epochtraining_params['max_epochs'] = 300 # Only if you want to change the number of epochs compared to the previous run
The following remains the sametrainer.train(model=model,
training_params=training_params, train_loader=train_dataloader, valid_loader=valid_dataloader)
Does this answer your question? @absmahi01 https://github.com/absmahi01 @AdnanMunir294 https://github.com/AdnanMunir294
— Reply to this email directly, view it on GitHub https://github.com/Deci-AI/super-gradients/issues/940#issuecomment-1537169688, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASWEQLZRU444EL2X4MULI3TXEZXRPANCNFSM6AAAAAAXX6HQXM . You are receiving this because you were mentioned.Message ID: @.***>
@Louis-Dupont but where we will load our ".pth" weights file in script ?
Mr. Adnan Munir Graduate Assistant Dept of Computer Engineering King Fahd University of Petroleum and Minerals
On Sat, May 6, 2023 at 8:09 PM Abu Bakar Siddique Mahi < @.***> wrote:
@Louis-Dupont https://github.com/Louis-Dupont but where we will load our ".pth" weights file in script ?
— Reply to this email directly, view it on GitHub https://github.com/Deci-AI/super-gradients/issues/940#issuecomment-1537183771, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASWEQL6VCAEVR7YETHTQTPTXE2AUPANCNFSM6AAAAAAXX6HQXM . You are receiving this because you were mentioned.Message ID: @.***>
@AdnanMunir294 exactly, resume_path
allows you to specify which checkpoint you want to use to resume your experiment. Note that when using resume_path
you don't need to set resume=True
:)
@absmahi01 , when you set resume=True
, the model will load the model named ckpt_latest.pth
in your experiment directory: <ckpt_root_dir>/<experiment_name>/ckpt_latest.pth
).
This is usually what you will want to do because you don't need to know the path of your checkpoint.
But if you are in the same situation as @AdnanMunir294 , and have a custom name for your .pth
checkpoint, you might want to use resume_path
.
Note this is for resuming an experiment, not for loading a checkpoint from another experiment.
You can find more information about checkpoints in the documentation :)
Thank you for the clarification. Yes my model start training.
On Sat, 6 May 2023, 9:29 pm Louis-Dupont, @.***> wrote:
@absmahi01 https://github.com/absmahi01 , when you set resume=True, the model will load the model named ckpt_latest.pth in your experiment directory:
/ /ckpt_latest.pth). This is usually what you will want to do because you don't need to know the path of your checkpoint, but I see that in your case, you have a checkpoint named ckpt_best9.pth (and not ckpt_latest.pth), so it makes sense you would want to use resume_path — Reply to this email directly, view it on GitHub https://github.com/Deci-AI/super-gradients/issues/940#issuecomment-1537197599, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASWEQL5WBWY3ZFTKETOELPLXE2J7XANCNFSM6AAAAAAXX6HQXM . You are receiving this because you were mentioned.Message ID: @.***>
}
train_params["resume_path"] ='checkpoints/base_weapons11_250/ckpt_latest.pth'
im getting new log file & i lost my previous weights, it started training from start, this very bad for me bcz its genrating new logs ,& starting from new weights.
should i need to mentaion resume =true before starting training of model? the last.pt file is not even working
can you please provide more detail doc for resumeing from last .pt & it will update in same log file, insted of making diff log
@Louis-Dupont size of model im getting is very high? 628 mb, dont you think it will make model slower? or should i need to optimise further ?
Let's say you trained on
training_hyperparams.max_epochs=50
and now want to run for300
instead:If you are training with a recipe
python -m super_gradients.train_from_recipe --config-name=<config-name> experiment_name=<experiment-name> training_hyperparams.resume=True training_hyperparams.max_epochs=300
Make sure to keep the same
experiment_name
as in your previous training since it will be used to locate and update the checkpoints.If you want to create a new experiment based on an existing checkpoint, I invite you to check out this documentation page
If you are running from a script
... training_params = {...} # Whatever you have been using training_params['resume'] = True # This will let the Trainer know to load the existing checkpoint and start from the last epoch training_params['max_epochs'] = 300 # Only if you want to change the number of epochs compared to the previous run # The following remains the same trainer.train(model=model, training_params=training_params, train_loader=train_dataloader, valid_loader=valid_dataloader)
Does this answer your question? @absmahi01 @AdnanMunir294
thanks,but during resume training val period,it happened: torch.distributed.elastic.multiprocessing.api][ERROR] - failed (exitcode: -9) how to solve it?or should I modify some parameters?
Let's say you trained on
training_hyperparams.max_epochs=50
and now want to run for300
instead:If you are training with a recipe
python -m super_gradients.train_from_recipe --config-name=<config-name> experiment_name=<experiment-name> training_hyperparams.resume=True training_hyperparams.max_epochs=300
Make sure to keep the same
experiment_name
as in your previous training since it will be used to locate and update the checkpoints.If you want to create a new experiment based on an existing checkpoint, I invite you to check out this documentation page
If you are running from a script
... training_params = {...} # Whatever you have been using training_params['resume'] = True # This will let the Trainer know to load the existing checkpoint and start from the last epoch training_params['max_epochs'] = 300 # Only if you want to change the number of epochs compared to the previous run # The following remains the same trainer.train(model=model, training_params=training_params, train_loader=train_dataloader, valid_loader=valid_dataloader)
Does this answer your question? @absmahi01 @AdnanMunir294
Another thing that puzzles me is why my resume training doesn't work when I set resume: True in the yaml file, but it works when I add training_hyperparams.resume=True on the command line?
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.