Closed lekashree-j closed 8 months ago
Same Doubt?
So your config has following settings:
"initial_lr": 5e-4,
"lr_mode": "cosine",
"cosine_final_lr_ratio": 0.1,
"max_epochs": 100,
That means LR will drop from initial_lr
by a factor of cosine_final_lr_ratio
along max_epochs
period.
Now, if you want to resume the training and increase the number of epochs to 200
. In this case the learning rate will still follow the cosine decay and will drop from initial_lr
to 0.1x of that value just for 200 epochs.
So if you want to lower LR towards the end of training you should set the cosine_final_lr_ratio
to lower values.
A few thoughts what you can also try:
1) Check the mAP plots in Tensorboard and check how the trend is looking. Maybe the LR is too small and increasing it would actually help model to train faster. 2) Play with EMA (It is enabled now, try switching it off and see how it impacts the accuracy. Try setting a different ema decay values: From 0.9 try 0.99, 0.999 and see how it affects).
Tuning hyperparameters can be time-consuming but unfortunately there are no universal recipe for training that works on all datasets.
Hi @BloodAxe Thanks for the clarification regarding the LR. I couldn't resume training for some reason. I waited 2.5hrs after resuming the training and checked the log files later it's not starting it but the code cell in my notebook is running. Do u know what's the issue? And I tried switching EMA off there is not much change in the mAP@0.5 values.
These plots bellow are from my initial training with 100 epochs: This is my LR plot from tensorboard: This is my mAP@0.5 plot:
Do you mind showing how you "resumed" that training? A code snippet showing the code would help understand what steps you did.
Yes! @BloodAxe Below is that code:
CHECKPOINT_DIR = 'checkpoints'
trainer = Trainer(experiment_name='version_1', ckpt_root_dir=CHECKPOINT_DIR)
best_model = models.get('yolo_nas_l',
num_classes=len(dataset_params['classes']),
checkpoint_path="/kaggle/input/yolo-nas-0-343/kaggle/working/checkpoints/version_1/RUN_20231227_111030_152840/ckpt_best.pth")
train_params = {
'resume': True,
# 'ckpt_name':'ckpt_latest.pth',
# 'resume_strict_load': True,
# ENABLING SILENT MODE
'silent_mode': True,
"average_best_models":True,
"warmup_mode": "LinearEpochLRWarmup",
"warmup_initial_lr": 1e-6,
"lr_warmup_epochs": 3,
"initial_lr": 5e-4,
"lr_mode": "cosine",
"cosine_final_lr_ratio": 0.1,
"optimizer": "Adam",
"optimizer_params": {"weight_decay": 0.0001},
"zero_weight_decay_on_bias_and_bn": True,
"ema": True,
"ema_params": {"decay": 0.9, "decay_type": "threshold"},
# ONLY TRAINING FOR 10 EPOCHS FOR THIS EXAMPLE NOTEBOOK
"max_epochs": 200,
"mixed_precision": True,
"loss": PPYoloELoss(
use_static_assigner=False,
# NOTE: num_classes needs to be defined here
num_classes=len(dataset_params['classes']),
reg_max=16
),
"valid_metrics_list": [
DetectionMetrics_050(
score_thres=0.1,
top_k_predictions=300,
# NOTE: num_classes needs to be defined here
num_cls=len(dataset_params['classes']),
normalize_targets=True,
post_prediction_callback=PPYoloEPostPredictionCallback(
score_threshold=0.01,
nms_top_k=1000,
max_predictions=300,
nms_threshold=0.7
)
)
],
"metric_to_watch": 'mAP@0.50'
}
trainer.train(model=best_model,
training_params=train_params,
train_loader=train_data,
valid_loader=val_data)
Console log from the above traner.train code:
[2023-12-28 07:52:31] INFO - sg_trainer.py - Resuming training from latest run.
[2023-12-28 07:52:31] INFO - sg_trainer.py - Checkpoints directory: checkpoints/version_1/RUN_20231227_111030_152840
[2023-12-28 07:52:32] INFO - checkpoint_utils.py - Successfully loaded model weights from checkpoints/version_1/RUN_20231227_111030_152840/ckpt_latest.pth checkpoint.
[2023-12-28 07:52:32] WARNING - sg_trainer.py - [WARNING] Main network has been loaded from checkpoint but EMA network exists as well. It will only be loaded during validation when training with ema=True.
[2023-12-28 07:52:32] INFO - sg_trainer.py - Using EMA with params {'decay': 0.9, 'decay_type': 'threshold'}
The console stream is now moved to checkpoints/version_1/RUN_20231227_111030_152840/console_Dec28_07_52_32.txt
[2023-12-28 07:52:36] INFO - sg_trainer_utils.py - TRAINING PARAMETERS:
- Mode: Single GPU
- Number of GPUs: 1 (1 available on the machine)
- Full dataset size: 3515 (len(train_set))
- Batch size per GPU: 16 (batch_size)
- Batch Accumulate: 1 (batch_accumulate)
- Total batch size: 16 (num_gpus * batch_size)
- Effective Batch size: 16 (num_gpus * batch_size * batch_accumulate)
- Iterations per epoch: 219 (len(train_loader))
- Gradient updates per epoch: 219 (len(train_loader) / batch_accumulate)
- Model: YoloNAS_L (66.92M parameters, 66.92M optimized)
- Learning Rates and Weight Decays:
- default: (66.92M parameters). LR: 5.000000984100999e-05 (66.92M parameters) WD: 0.0, (84.73K parameters), WD: 0.0001, (66.83M parameters)
[2023-12-28 09:58:44] INFO - sg_trainer.py -
[MODEL TRAINING EXECUTION HAS BEEN INTERRUPTED]... Please wait until SOFT-TERMINATION process finishes and saves all of the Model Checkpoints and log files before terminating...
[2023-12-28 09:58:44] INFO - sg_trainer.py - For HARD Termination - Stop the process again
[2023-12-28 09:58:44] INFO - base_sg_logger.py - [CLEANUP] - Successfully stopped system monitoring process
I terminated it because it was running for over 2.5 hours and nothing was happening.
This is the log file for resuming training:
[2023-12-28 07:31:06] INFO - super_gradients.common.crash_handler.crash_tips_setup - Crash tips is enabled. You can set your environment variable to CRASH_HANDLER=FALSE to disable it
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: coverage required but not found[0m
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: sphinx required but not found[0m
[2023-12-28 07:31:13] DEBUG - super_gradients.sanity_check.env_sanity_check - torchmetrics==1.2.1 does not satisfy requirement torchmetrics==0.8
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: hydra-core required but not found[0m
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: omegaconf required but not found[0m
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: onnxruntime required but not found[0m
[2023-12-28 07:31:13] DEBUG - super_gradients.sanity_check.env_sanity_check - onnx==1.15.0 does not satisfy requirement onnx==1.13.0
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: pip-tools required but not found[0m
[2023-12-28 07:31:13] DEBUG - super_gradients.sanity_check.env_sanity_check - pyparsing==3.0.9 does not satisfy requirement pyparsing==2.4.5
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: einops required but not found[0m
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: pycocotools required but not found[0m
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: treelib required but not found[0m
[2023-12-28 07:31:13] DEBUG - super_gradients.sanity_check.env_sanity_check - termcolor==2.3.0 does not satisfy requirement termcolor==1.1.0
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: stringcase required but not found[0m
[2023-12-28 07:31:13] DEBUG - super_gradients.sanity_check.env_sanity_check - numpy==1.24.3 does not satisfy requirement numpy<=1.23
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: json-tricks required but not found[0m
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: onnx-simplifier required but not found[0m
[2023-12-28 07:31:13] WARNING - super_gradients.sanity_check.env_sanity_check - [31mFailed to verify installed packages: data-gradients required but not found[0m
[2023-12-28 07:42:33] WARNING - super_gradients.training.sg_trainer.sg_trainer - [WARNING] Main network has been loaded from checkpoint but EMA network exists as well. It will only be loaded during validation when training with ema=True.
[2023-12-28 07:50:26] WARNING - super_gradients.training.sg_trainer.sg_trainer - [WARNING] Main network has been loaded from checkpoint but EMA network exists as well. It will only be loaded during validation when training with ema=True.
[2023-12-28 07:50:26] INFO - super_gradients.training.sg_trainer.sg_trainer - Using EMA with params {'decay': 0.9, 'decay_type': 'threshold'}
[2023-12-28 07:50:31] INFO - super_gradients.training.utils.sg_trainer_utils - TRAINING PARAMETERS:
- Mode: Single GPU
- Number of GPUs: 1 (1 available on the machine)
- Full dataset size: 3515 (len(train_set))
- Batch size per GPU: 16 (batch_size)
- Batch Accumulate: 1 (batch_accumulate)
- Total batch size: 16 (num_gpus * batch_size)
- Effective Batch size: 16 (num_gpus * batch_size * batch_accumulate)
- Iterations per epoch: 219 (len(train_loader))
- Gradient updates per epoch: 219 (len(train_loader) / batch_accumulate)
- Model: YoloNAS_L (66.92M parameters, 66.92M optimized)
- Learning Rates and Weight Decays:
- default: (66.92M parameters). LR: 5.000000984100999e-05 (66.92M parameters) WD: 0.0, (84.73K parameters), WD: 0.0001, (66.83M parameters)
[2023-12-28 07:50:31] INFO - super_gradients.training.sg_trainer.sg_trainer - RUNNING ADDITIONAL TEST ON THE AVERAGED MODEL...
[2023-12-28 07:50:49] INFO - super_gradients.training.sg_trainer.sg_trainer -
[MODEL TRAINING EXECUTION HAS BEEN INTERRUPTED]... Please wait until SOFT-TERMINATION process finishes and saves all of the Model Checkpoints and log files before terminating...
[2023-12-28 07:50:49] INFO - super_gradients.training.sg_trainer.sg_trainer - For HARD Termination - Stop the process again
[2023-12-28 07:50:49] INFO - super_gradients.common.sg_loggers.base_sg_logger - [CLEANUP] - Successfully stopped system monitoring process
[2023-12-28 07:52:31] INFO - super_gradients.training.sg_trainer.sg_trainer - Resuming training from latest run.
[2023-12-28 07:52:31] INFO - super_gradients.training.sg_trainer.sg_trainer - Checkpoints directory: checkpoints/version_1/RUN_20231227_111030_152840
[2023-12-28 07:52:32] INFO - super_gradients.training.utils.checkpoint_utils - Successfully loaded model weights from checkpoints/version_1/RUN_20231227_111030_152840/ckpt_latest.pth checkpoint.
[2023-12-28 07:52:32] DEBUG - super_gradients.training.utils.checkpoint_utils - Trying to load preprocessing params from checkpoint. Preprocessing params in checkpoint: True. Model YoloNAS_L inherit HasPredict: True
[2023-12-28 07:52:32] DEBUG - super_gradients.training.utils.checkpoint_utils - Successfully loaded preprocessing params from checkpoint {'class_names': ['Aortic enlargement', 'Atelectasis', 'Calcification', 'Cardiomegaly', 'Consolidation', 'ILD', 'Infiltration', 'Lung Opacity', 'Nodule/Mass', 'Other lesion', 'Pleural effusion', 'Pleural thickening', 'Pneumothorax', 'Pulmonary fibrosis'], 'image_processor': {'ComposeProcessing': {'processings': [<super_gradients.training.processing.processing.ReverseImageChannels object at 0x7b27d180fdf0>, <super_gradients.training.processing.processing.DetectionLongestMaxSizeRescale object at 0x7b27d180fb20>, <super_gradients.training.processing.processing.DetectionLongestMaxSizeRescale object at 0x7b27d180feb0>, <super_gradients.training.processing.processing.DetectionBottomRightPadding object at 0x7b27d180fc10>, <super_gradients.training.processing.processing.ImagePermute object at 0x7b27d180fa60>]}}, 'iou': 0.65, 'conf': 0.5}
[2023-12-28 07:52:32] WARNING - super_gradients.training.sg_trainer.sg_trainer - [WARNING] Main network has been loaded from checkpoint but EMA network exists as well. It will only be loaded during validation when training with ema=True.
Hey @BloodAxe any idea on how to resolve this?
The first thing that struck me is 'silent_mode': True,
. Why do you want to set it explicitly and enable it? It disables the console output and it is not meant to be set explicitly. It is used mostly under DDP mode to differentiate between main and non-main nodes. You just don't want to specify it at all.
When using resume
feature you usually don't need to load any checkpoint beforehand. When enabled, Trainer will look for the latest run of the given experiment name and attempt to continue it. That means , no additional run folder will be created, but instead checkpoints will be stored in the existing experiment directory (Tensorboard files will be appended).
So my suggestion is to remove silent_mode
and double-check whether your trainer continues a training from the epoch where you stopped.
Another suggestion not relevant to the resume - try increasing EMA to higher values (0.99, 0.999) and changing decay_type
to exp
(and playing with beta
parameter under ema_params
. A good starting value could be around 20 or so. This should smooth the training and make mAP oscillate less.
Thanks @BloodAxe trying out those suggestions now. started training with yolo_nas_m
.
The params used for yolo_nas_m
:
from super_gradients.training.losses import PPYoloELoss
from super_gradients.training.metrics import DetectionMetrics_050
from super_gradients.training.models.detection_models.pp_yolo_e import PPYoloEPostPredictionCallback
train_params = {
# ENABLING SILENT MODE
# 'resume':True,
# 'silent_mode': True,
"average_best_models":True,
"warmup_mode": "LinearEpochLRWarmup",
"warmup_initial_lr": 1e-6,
"lr_warmup_epochs": 3,
"initial_lr": 5e-4,
"lr_mode": "cosine",
"cosine_final_lr_ratio": 0.1,
"optimizer": "Adam",
"optimizer_params": {"weight_decay": 0.0001},
"zero_weight_decay_on_bias_and_bn": True,
"ema": True,
"ema_params": {"decay": 0.999, "decay_type": "exp", "beta":20},
# ONLY TRAINING FOR 10 EPOCHS FOR THIS EXAMPLE NOTEBOOK
"max_epochs": 300,
"mixed_precision": True,
"loss": PPYoloELoss(
use_static_assigner=False,
# NOTE: num_classes needs to be defined here
num_classes=len(dataset_params['classes']),
reg_max=16
),
"valid_metrics_list": [
DetectionMetrics_050(
score_thres=0.1,
top_k_predictions=300,
# NOTE: num_classes needs to be defined here
num_cls=len(dataset_params['classes']),
normalize_targets=True,
post_prediction_callback=PPYoloEPostPredictionCallback(
score_threshold=0.01,
nms_top_k=1000,
max_predictions=300,
nms_threshold=0.7
)
)
],
"metric_to_watch": 'mAP@0.50'
}
Console:
[2023-12-29 10:49:20] INFO - sg_trainer.py - Starting a new run with `run_id=RUN_20231229_104920_025491`
[2023-12-29 10:49:20] INFO - sg_trainer.py - Checkpoints directory: checkpoints/version1/RUN_20231229_104920_025491
[2023-12-29 10:49:20] INFO - sg_trainer.py - Using EMA with params {'decay': 0.999, 'decay_type': 'exp', 'beta': 20}
The console stream is now moved to checkpoints/version1/RUN_20231229_104920_025491/console_Dec29_10_49_20.txt
[2023-12-29 10:49:22] INFO - sg_trainer_utils.py - TRAINING PARAMETERS:
- Mode: Single GPU
- Number of GPUs: 1 (1 available on the machine)
- Full dataset size: 3515 (len(train_set))
- Batch size per GPU: 16 (batch_size)
- Batch Accumulate: 1 (batch_accumulate)
- Total batch size: 16 (num_gpus * batch_size)
- Effective Batch size: 16 (num_gpus * batch_size * batch_accumulate)
- Iterations per epoch: 219 (len(train_loader))
- Gradient updates per epoch: 219 (len(train_loader) / batch_accumulate)
- Model: YoloNAS_M (51.14M parameters, 51.14M optimized)
- Learning Rates and Weight Decays:
- default: (51.14M parameters). LR: 0.0005 (51.14M parameters) WD: 0.0, (72.25K parameters), WD: 0.0001, (51.07M parameters)
[2023-12-29 10:49:22] INFO - sg_trainer.py - Started training for 300 epochs (0/299)
Train epoch 0: 100%|██████████| 219/219 [03:16<00:00, 1.11it/s, PPYoloELoss/loss=4.06, PPYoloELoss/loss_cls=1.97, PPYoloELoss/loss_dfl=1.01, PPYoloELoss/loss_iou=1.08, gpu_mem=14.4]
Validating: 100%|██████████| 55/55 [00:16<00:00, 3.26it/s]
[2023-12-29 10:52:58] INFO - base_sg_logger.py - Checkpoint saved in checkpoints/version1/RUN_20231229_104920_025491/ckpt_best.pth
[2023-12-29 10:52:58] INFO - sg_trainer.py - Best checkpoint overriden: validation mAP@0.50: 0.00011028622975572944
===========================================================
SUMMARY OF EPOCH 0
├── Train
│ ├── Ppyoloeloss/loss_cls = 1.9693
│ ├── Ppyoloeloss/loss_iou = 1.0825
│ ├── Ppyoloeloss/loss_dfl = 1.0115
│ └── Ppyoloeloss/loss = 4.0634
└── Validation
├── Ppyoloeloss/loss_cls = 1.9522
├── Ppyoloeloss/loss_iou = 1.0182
├── Ppyoloeloss/loss_dfl = 0.982
├── Ppyoloeloss/loss = 3.9524
├── Precision@0.50 = 0.0
├── Recall@0.50 = 0.0
├── Map@0.50 = 0.0001
└── F1@0.50 = 0.0
===========================================================
Train epoch 1: 100%|██████████| 219/219 [03:09<00:00, 1.16it/s, PPYoloELoss/loss=2.87, PPYoloELoss/loss_cls=1.37, PPYoloELoss/loss_dfl=0.755, PPYoloELoss/loss_iou=0.741, gpu_mem=14.5]
Validating epoch 1: 100%|██████████| 55/55 [00:21<00:00, 2.52it/s]
[2023-12-29 10:56:35] INFO - base_sg_logger.py - Checkpoint saved in checkpoints/version1/RUN_20231229_104920_025491/ckpt_best.pth
[2023-12-29 10:56:35] INFO - sg_trainer.py - Best checkpoint overriden: validation mAP@0.50: 0.11669479310512543
===========================================================
SUMMARY OF EPOCH 1
├── Train
│ ├── Ppyoloeloss/loss_cls = 1.3723
│ │ ├── Epoch N-1 = 1.9693 (↘ -0.597)
│ │ └── Best until now = 1.9693 (↘ -0.597)
│ ├── Ppyoloeloss/loss_iou = 0.7405
│ │ ├── Epoch N-1 = 1.0825 (↘ -0.342)
│ │ └── Best until now = 1.0825 (↘ -0.342)
│ ├── Ppyoloeloss/loss_dfl = 0.7552
│ │ ├── Epoch N-1 = 1.0115 (↘ -0.2563)
│ │ └── Best until now = 1.0115 (↘ -0.2563)
│ └── Ppyoloeloss/loss = 2.868
│ ├── Epoch N-1 = 4.0634 (↘ -1.1954)
│ └── Best until now = 4.0634 (↘ -1.1954)
└── Validation
├── Ppyoloeloss/loss_cls = 1.3076
│ ├── Epoch N-1 = 1.9522 (↘ -0.6446)
│ └── Best until now = 1.9522 (↘ -0.6446)
├── Ppyoloeloss/loss_iou = 0.6144
│ ├── Epoch N-1 = 1.0182 (↘ -0.4039)
│ └── Best until now = 1.0182 (↘ -0.4039)
├── Ppyoloeloss/loss_dfl = 0.6611
│ ├── Epoch N-1 = 0.982 (↘ -0.3208)
│ └── Best until now = 0.982 (↘ -0.3208)
├── Ppyoloeloss/loss = 2.5832
│ ├── Epoch N-1 = 3.9524 (↘ -1.3693)
│ └── Best until now = 3.9524 (↘ -1.3693)
├── Precision@0.50 = 0.025
│ ├── Epoch N-1 = 0.0 (↗ 0.025)
│ └── Best until now = 0.0 (↗ 0.025)
├── Recall@0.50 = 0.3619
│ ├── Epoch N-1 = 0.0 (↗ 0.3619)
│ └── Best until now = 0.0 (↗ 0.3619)
├── Map@0.50 = 0.1167
│ ├── Epoch N-1 = 0.0001 (↗ 0.1166)
│ └── Best until now = 0.0001 (↗ 0.1166)
└── F1@0.50 = 0.0354
├── Epoch N-1 = 0.0 (↗ 0.0354)
└── Best until now = 0.0 (↗ 0.0354)
===========================================================
Train epoch 2: 41%|████ | 90/219 [01:17<01:49, 1.17it/s, PPYoloELoss/loss=2.63, PPYoloELoss/loss_cls=1.27, PPYoloELoss/loss_dfl=0.7, PPYoloELoss/loss_iou=0.669, gpu_mem=12.9]
Closing issue as stale/resolved
💡 Your Question
Hello SG Team, My dataset consist of 35k images in 512x512 size. I trained my yolo_nas_l model for 100 epochs initially. Even after 100 mAP@.5 is 0.343. So, I decided to train an additional 100 epochs. My question is the LR at 0th epoch (when i initially started training ) is 5e-4 after 100th epoch my LR is 5.000000984100999e-05. It's 1/10th of it. It has been gradually reduced according to the given training parameters. So, now when i resume training it starts at LR 5.000000984100999e-05. At the end of 200th epoch will it 1/10th of this number or will it still be 1/10th of 5e-4?
My Initial Train Params:
My Train Params for resuming:
Versions
No response