How to resume training on the previous checkpoint

hassanbadawy commented 1 year ago

💡 Your Question

Hi Team, Thanks for giving us this masterpiece, I trained a model successfully using the tutorials you provided and now I've a model in ./checkpoints/exp1/ckpt_best.pth and I want to load this model and resume the training, I tried to use super_gradients.training.models.get(model_path) but it is expecting common model name aka, 'yolo_nas_l', can you help me?

Versions

No response

binrey commented 1 year ago

Set parameter "resume" with the same experiment_name in trainer initialization

train_params = {
    "resume": True,
...
}
trainer.train(model=model, 
              training_params=train_params, 
              train_loader=train_data, 
              valid_loader=val_data)

AlimTleuliyev commented 1 year ago

Hello @binrey

I trained the model using this params:

train_params = {
    # ENABLING SILENT MODE
    "average_best_models":True,
    "warmup_mode": "linear_epoch_step",
    "warmup_initial_lr": 1e-6,
    "lr_warmup_epochs": 3,
    "initial_lr": 5e-4,
    "lr_mode": "cosine",
    "cosine_final_lr_ratio": 0.1,
    "optimizer": "Adam",
    "optimizer_params": {"weight_decay": 0.0001},
    "zero_weight_decay_on_bias_and_bn": True,
    "ema": True,
    "ema_params": {"decay": 0.9, "decay_type": "threshold"},
    # ONLY TRAINING FOR 10 EPOCHS FOR THIS EXAMPLE NOTEBOOK
    "max_epochs": 100,
    "mixed_precision": True,
    "loss": PPYoloELoss(
        use_static_assigner=False,
        # NOTE: num_classes needs to be defined here
        num_classes=config.NUM_CLASSES,
        reg_max=16
    ),
    "valid_metrics_list": [
        DetectionMetrics_050(
            score_thres=0.1,
            top_k_predictions=300,
            # NOTE: num_classes needs to be defined here
            num_cls=config.NUM_CLASSES,
            normalize_targets=True,
            post_prediction_callback=PPYoloEPostPredictionCallback(
                score_threshold=0.01,
                nms_top_k=1000,
                max_predictions=300,
                nms_threshold=0.7
            )
        )
    ],
    "metric_to_watch": 'mAP@0.50'
}

The training has finished, but I think I need to train for another 50-100 epochs. I want to resume from the last_ckpt.pth. What should I do? Also, when I will start the training again what will happen to the learning rate, like how does it work?

hassanbadawy commented 1 year ago

Thank you so much, appreciate it.

Minh-Tu-Cao commented 1 year ago

@hassanbadawy is your issue fixed?

hassanbadawy commented 1 year ago

Hi Deci, Thank you for your support, it works well. Regards, Hasan

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg

From: Minh-Tu Cao @.> Sent: Saturday, July 1, 2023 7:11:40 PM To: Deci-AI/super-gradients @.> Cc: Hassan Badawy @.>; Mention @.> Subject: Re: [Deci-AI/super-gradients] How to resume training on the previous checkpoint (Issue #1139)

@hassanbadawyhttps://github.com/hassanbadawy is your issue fixed?

— Reply to this email directly, view it on GitHubhttps://github.com/Deci-AI/super-gradients/issues/1139#issuecomment-1615977105, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AK7KID47XIF3O7XF3JHKOHLXOBD3ZANCNFSM6AAAAAAY4BZBOY. You are receiving this because you were mentioned.Message ID: @.***>

harpreetsahota204 commented 1 year ago

Hi @binrey @hassanbadawy @Minh-Tu-Cao @AlimTleuliyev !

Thanks for coming to each other's aid on this issue. I'm gathering some feedback on SuperGradients and YOLO-NAS.

Would you be down for a quick call to chat about your experience?

If a call doesn't work for you, no worries. I've got a short survey you could fill out: https://bit.ly/sgyn-feedback.

I know you’re super busy, but your input will help us shape the direction of SuperGradients and make it as useful as possible for you.

I appreciate your time and feedback. Let me know what works for you.

Cheers,

Harpreet

Deci-AI / super-gradients

How to resume training on the previous checkpoint #1139

💡 Your Question

Versions