Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.54k stars 496 forks source link

How to resume training ? for Yolo-NAS #940

Closed AdnanMunir294 closed 1 year ago

AdnanMunir294 commented 1 year ago

Is your feature request related to a problem? Please describe.

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like

A clear and concise description of what you want to happen.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

dagshub[bot] commented 1 year ago

Join the discussion on DagsHub!

AdnanMunir294 commented 1 year ago

As I am training Yolo-Nas my system got stuck to I have ckpt.best_path file. how to resume my training for left epochs? Thanks

Louis-Dupont commented 1 year ago

To resume you just need to set training_params.resume = True Feel free to check the documentation for loading checkpoints if you're interested, it explains more generally how to use checkpoints, with a small section for resuming specifically (Resuming Training)

absmahi01 commented 1 year ago

@Louis-Dupont can you please write the command line code

Louis-Dupont commented 1 year ago

Let's say you trained on training_hyperparams.max_epochs=50 and now want to run for 300 instead:

If you are training with a recipe

python -m super_gradients.train_from_recipe --config-name=<config-name> experiment_name=<experiment-name> training_hyperparams.resume=True training_hyperparams.max_epochs=300

Make sure to keep the same experiment_name as in your previous training since it will be used to locate and update the checkpoints.

If you want to create a new experiment based on an existing checkpoint, I invite you to check out this documentation page

If you are running from a script

...

training_params = {...} # Whatever you have been using
training_params['resume'] = True # This will let the Trainer know to load the existing checkpoint and start from the last epoch
training_params['max_epochs'] = 300 # Only if you want to change the number of epochs compared to the previous run

# The following remains the same
trainer.train(model=model, 
              training_params=training_params, 
              train_loader=train_dataloader,
              valid_loader=valid_dataloader)

Does this answer your question? @absmahi01 @AdnanMunir294

AdnanMunir294 commented 1 year ago

but where we will load our ".pth" weights file in script ?

Mr. Adnan Munir Graduate Assistant Dept of Computer Engineering King Fahd University of Petroleum and Minerals

On Sat, May 6, 2023 at 6:51 PM Louis-Dupont @.***> wrote:

Let's say you trained on training_hyperparams.max_epochs=50 and now want to run for 300 instead: If you are training with a recipe https://docs.deci.ai/super-gradients/src/super_gradients/recipes/Training_Recipes.html

python -m super_gradients.train_from_recipe --config-name= experiment_name= training_hyperparams.resume=True training_hyperparams.max_epochs=300

Make sure to keep the same experiment_name as in your previous training since it will be used to locate and update the checkpoints.

If you want to create a new experiment based on an existing checkpoint, I invite you to checkout this documentation page https://docs.deci.ai/super-gradients/documentation/source/Checkpoints.html#loading-checkpoints If you are running from a script

... training_params = {...} # Whatever you have been usingtraining_params['resume'] = True # This will let the Trainer know to load the existing checkpoint and start from the last epochtraining_params['max_epochs'] = 300 # Only if you want to change the number of epochs compared to the previous run

The following remains the sametrainer.train(model=model,

          training_params=training_params,
          train_loader=train_dataloader,
          valid_loader=valid_dataloader)

Does this answer your question? @absmahi01 https://github.com/absmahi01 @AdnanMunir294 https://github.com/AdnanMunir294

— Reply to this email directly, view it on GitHub https://github.com/Deci-AI/super-gradients/issues/940#issuecomment-1537169688, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASWEQLZRU444EL2X4MULI3TXEZXRPANCNFSM6AAAAAAXX6HQXM . You are receiving this because you were mentioned.Message ID: @.***>

AdnanMunir294 commented 1 year ago

[image: image.png] PLease help I am getting this error

Mr. Adnan Munir Graduate Assistant Dept of Computer Engineering King Fahd University of Petroleum and Minerals

On Sat, May 6, 2023 at 7:08 PM Adnan Munir khokhar @.***> wrote:

but where we will load our ".pth" weights file in script ?

Mr. Adnan Munir Graduate Assistant Dept of Computer Engineering King Fahd University of Petroleum and Minerals

On Sat, May 6, 2023 at 6:51 PM Louis-Dupont @.***> wrote:

Let's say you trained on training_hyperparams.max_epochs=50 and now want to run for 300 instead: If you are training with a recipe https://docs.deci.ai/super-gradients/src/super_gradients/recipes/Training_Recipes.html

python -m super_gradients.train_from_recipe --config-name= experiment_name= training_hyperparams.resume=True training_hyperparams.max_epochs=300

Make sure to keep the same experiment_name as in your previous training since it will be used to locate and update the checkpoints.

If you want to create a new experiment based on an existing checkpoint, I invite you to checkout this documentation page https://docs.deci.ai/super-gradients/documentation/source/Checkpoints.html#loading-checkpoints If you are running from a script

... training_params = {...} # Whatever you have been usingtraining_params['resume'] = True # This will let the Trainer know to load the existing checkpoint and start from the last epochtraining_params['max_epochs'] = 300 # Only if you want to change the number of epochs compared to the previous run

The following remains the sametrainer.train(model=model,

          training_params=training_params,
          train_loader=train_dataloader,
          valid_loader=valid_dataloader)

Does this answer your question? @absmahi01 https://github.com/absmahi01 @AdnanMunir294 https://github.com/AdnanMunir294

— Reply to this email directly, view it on GitHub https://github.com/Deci-AI/super-gradients/issues/940#issuecomment-1537169688, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASWEQLZRU444EL2X4MULI3TXEZXRPANCNFSM6AAAAAAXX6HQXM . You are receiving this because you were mentioned.Message ID: @.***>

absmahi01 commented 1 year ago

@Louis-Dupont but where we will load our ".pth" weights file in script ?

AdnanMunir294 commented 1 year ago

Done. train_params['resume'] = True train_params["resume_path"] = '/content/checkpoints/my_first_yolonas_run/ckpt_best9.pth' trainer.train(model=model, training_params=train_params, train_loader=train_data, valid_loader=val_data)

Mr. Adnan Munir Graduate Assistant Dept of Computer Engineering King Fahd University of Petroleum and Minerals

On Sat, May 6, 2023 at 8:09 PM Abu Bakar Siddique Mahi < @.***> wrote:

@Louis-Dupont https://github.com/Louis-Dupont but where we will load our ".pth" weights file in script ?

— Reply to this email directly, view it on GitHub https://github.com/Deci-AI/super-gradients/issues/940#issuecomment-1537183771, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASWEQL6VCAEVR7YETHTQTPTXE2AUPANCNFSM6AAAAAAXX6HQXM . You are receiving this because you were mentioned.Message ID: @.***>

Louis-Dupont commented 1 year ago

@AdnanMunir294 exactly, resume_path allows you to specify which checkpoint you want to use to resume your experiment. Note that when using resume_path you don't need to set resume=True :)

Louis-Dupont commented 1 year ago

@absmahi01 , when you set resume=True, the model will load the model named ckpt_latest.pth in your experiment directory: <ckpt_root_dir>/<experiment_name>/ckpt_latest.pth). This is usually what you will want to do because you don't need to know the path of your checkpoint.

But if you are in the same situation as @AdnanMunir294 , and have a custom name for your .pth checkpoint, you might want to use resume_path.

Note this is for resuming an experiment, not for loading a checkpoint from another experiment.

You can find more information about checkpoints in the documentation :)

AdnanMunir294 commented 1 year ago

Thank you for the clarification. Yes my model start training.

On Sat, 6 May 2023, 9:29 pm Louis-Dupont, @.***> wrote:

@absmahi01 https://github.com/absmahi01 , when you set resume=True, the model will load the model named ckpt_latest.pth in your experiment directory: //ckpt_latest.pth). This is usually what you will want to do because you don't need to know the path of your checkpoint, but I see that in your case, you have a checkpoint named ckpt_best9.pth (and not ckpt_latest.pth), so it makes sense you would want to use resume_path

— Reply to this email directly, view it on GitHub https://github.com/Deci-AI/super-gradients/issues/940#issuecomment-1537197599, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASWEQL5WBWY3ZFTKETOELPLXE2J7XANCNFSM6AAAAAAXX6HQXM . You are receiving this because you were mentioned.Message ID: @.***>

akashAD98 commented 1 year ago

}

train_params['resume'] = True

train_params["resume_path"] ='checkpoints/base_weapons11_250/ckpt_latest.pth'

im getting new log file & i lost my previous weights, it started training from start, this very bad for me bcz its genrating new logs ,& starting from new weights.

should i need to mentaion resume =true before starting training of model? the last.pt file is not even working

can you please provide more detail doc for resumeing from last .pt & it will update in same log file, insted of making diff log

akashAD98 commented 1 year ago

@Louis-Dupont size of model im getting is very high? 628 mb, dont you think it will make model slower? or should i need to optimise further ?

lsm140 commented 10 months ago

Let's say you trained on training_hyperparams.max_epochs=50 and now want to run for 300 instead:

If you are training with a recipe

python -m super_gradients.train_from_recipe --config-name=<config-name> experiment_name=<experiment-name> training_hyperparams.resume=True training_hyperparams.max_epochs=300

Make sure to keep the same experiment_name as in your previous training since it will be used to locate and update the checkpoints.

If you want to create a new experiment based on an existing checkpoint, I invite you to check out this documentation page

If you are running from a script

...

training_params = {...} # Whatever you have been using
training_params['resume'] = True # This will let the Trainer know to load the existing checkpoint and start from the last epoch
training_params['max_epochs'] = 300 # Only if you want to change the number of epochs compared to the previous run

# The following remains the same
trainer.train(model=model, 
              training_params=training_params, 
              train_loader=train_dataloader,
              valid_loader=valid_dataloader)

Does this answer your question? @absmahi01 @AdnanMunir294

thanks,but during resume training val period,it happened: torch.distributed.elastic.multiprocessing.api][ERROR] - failed (exitcode: -9) how to solve it?or should I modify some parameters?

lsm140 commented 10 months ago

Let's say you trained on training_hyperparams.max_epochs=50 and now want to run for 300 instead:

If you are training with a recipe

python -m super_gradients.train_from_recipe --config-name=<config-name> experiment_name=<experiment-name> training_hyperparams.resume=True training_hyperparams.max_epochs=300

Make sure to keep the same experiment_name as in your previous training since it will be used to locate and update the checkpoints.

If you want to create a new experiment based on an existing checkpoint, I invite you to check out this documentation page

If you are running from a script

...

training_params = {...} # Whatever you have been using
training_params['resume'] = True # This will let the Trainer know to load the existing checkpoint and start from the last epoch
training_params['max_epochs'] = 300 # Only if you want to change the number of epochs compared to the previous run

# The following remains the same
trainer.train(model=model, 
              training_params=training_params, 
              train_loader=train_dataloader,
              valid_loader=valid_dataloader)

Does this answer your question? @absmahi01 @AdnanMunir294

Another thing that puzzles me is why my resume training doesn't work when I set resume: True in the yaml file, but it works when I add training_hyperparams.resume=True on the command line?