YOLO NAS: inference time depending on training?

ullsen commented 5 months ago

Hello everybody and first of all thanks to the super_gradients team for the great work!

I have a theory question on the YOLO NAS object detection models. I am training on a custom data set and check the CPU inference time after training. I observed that longer model training leads to larger inference times, which surprises me a lot. I thought the main aspect of inference is the number of weights, which should be the same for a given model size? Instead, I see an increasing inference time the longer I train the model. Despite the different times, the model sizes remain more or less the same.

For training the models I used an identical setup (same input image size, model size, pre-processing pipeline, training hardware etc etc) with different number of epochs (I tried both, training from scratch and resuming a model by loading weights). I tried the medium and the large model with the same behaviour. After converting the models to onnx I see the same effect.

Let me know if you need more details and thanks for the help!


ofrimasad commented 5 months ago

Hi @ullsen .

Thank you for your kind words.

My main suspect here is NMS.

If you are exporting your model with NMS (The default behavior that can be overridden), then you are benchmarking the NMS along with the model.

The NMS is an iterative algorithm that starts with thresholding and sorting. When the model is better trained, more predictions will get higher confidence, and more predictions will cross the threshold and be included in the sorting part and the iterative part of the NMS.

To validate this assumption, you can export with the following flag:

model.export(... , postprocessing=False, ...)

You can also try calibrating the following params of the export function: num_pre_nms_predictions: (int) Number of predictions to keep before NMS. nms_threshold: (float) NMS threshold for the exported model. confidence_threshold: (float) Confidence threshold for the exported model.

Hope that helps.

ullsen commented 5 months ago

Hi @ofrimasad and thanks for the quick reply!

I tried both of your suggestions, but unfortunately there is no difference to observe.

Is the model architecture adapted during the training, or does NAS refere to a architecture optimisation prior to training?

I am open to further tests. ullsen

shaydeci commented 5 months ago

@ullsen the model architecture is not adapated through training. There should be no difference regarding the time the feed gorward takes. However, as @ofrimasad stated - the number of predictions might vary throughout training, which might explain differences in forward pass times. It would be helpful if you could share your code and environment details.

ullsen commented 5 months ago

sure! I am using a notebook instance on AWS with the pytorch conda environment.


yolo code

import datetime print('\nsuccessful start at ',, '\n')

from libs_yolo import *


resumePath = r'.../ckpt_latest.pth'

trainShape = [960, 1280]#[768, 1024]#[960, 1280]#[480, 640]#/4*3 scoreThresh = 0.5

ROOT_DIR = '...' train_imgs_dir = 'images_split/train' train_labels_dir = 'labels_split/train' val_imgs_dir = 'images_split/validate' val_labels_dir = 'labels_split/validate'

DEVICE = 'cuda'# if torch.cuda.is_available() else 'cpu' classes = ['className'] model_to_train = 'yolo_nas_m'

addSamples = 2#, 11 lr, opt = 'CosineLRScheduler', 'Adam' transformSettings = {'degrees':20, 'translate':0.2, 'scales':0.2, 'shear':10, 'target_size':trainShape[::-1]}

CHECKPOINT_DIR = os.path.join(ROOT_DIR,'%Y%m%d')+'_ckpts')#'checkpoints'

trainer = Trainer( experiment_name=model_to_train, ckpt_root_dir=CHECKPOINT_DIR )

transformsTrain = [ DetectionTargetsFormatTransform(input_dim=(trainShape), output_format="LABEL_CXCYWH"), DetectionRandomAffine(**transformSettings), DetectionHSV(prob=.5), DetectionHorizontalFlip(prob=.5), DetectionStandardize(max_value=255), ]

transformsValidate = [ DetectionTargetsFormatTransform(input_dim=(trainShape), output_format="LABEL_CXCYWH"), DetectionStandardize(max_value=255), ]

train_params = { 'silent_mode': True, "average_best_models":True, "warmup_mode": 'LinearEpochLRWarmup', #"linear_epoch_step", "warmup_initial_lr": 1e-6, "lr_warmup_epochs": 3, "initial_lr": 5e-4, "lr_mode": lr, 'lr_updates': [10, 20, 40, 70],# epochs at which lr is changed. applicable for 'lr_decay_factor':5e-4, # size of changed lr "cosine_final_lr_ratio": 0.1, "optimizer": opt,#'Adam','SGD','RMSProp' "optimizer_params": {"weight_decay": 0.0001}, "zero_weight_decay_on_bias_and_bn": True, "ema": True, "ema_params": {"decay": 0.9, "decay_type": "threshold"}, "max_epochs": EPOCHS, 'save_ckpt_epoch_list': [25, 50, 75,100, 125], "mixed_precision": True, "loss": PPYoloELoss( use_static_assigner=False, num_classes=len(classes), reg_max=16 ), "train_metrics_list": [ DetectionMetrics_050(score_thres=scoreThresh, top_k_predictions=300, num_cls=len(classes), normalize_targets=True, post_prediction_callback=PPYoloEPostPredictionCallback(score_threshold=0.01, nms_top_k=1000, max_predictions=300,nms_threshold=0.7) ) ], "valid_metrics_list": [ DetectionMetrics_050(score_thres=scoreThresh, top_k_predictions=300, num_cls=len(classes), normalize_targets=True, post_prediction_callback=PPYoloEPostPredictionCallback(score_threshold=0.01, nms_top_k=1000, max_predictions=300, nms_threshold=0.7) ) ], "metric_to_watch": 'mAP@0.50', 'phase_callbacks' : [logTrainingPerformance(trainer)] }

train_data = coco_detection_yolo_format_train( dataset_params={ 'data_dir': ROOT_DIR, 'images_dir': train_imgs_dir, 'labels_dir': train_labels_dir, 'classes': classes, 'input_dim': trainShape, 'transforms': transformsTrain }, dataloader_params={ 'batch_size':BATCH_SIZE, 'num_workers':WORKERS } )

val_data = coco_detection_yolo_format_val( dataset_params={ 'data_dir': ROOT_DIR, 'images_dir': val_imgs_dir, 'labels_dir': val_labels_dir, 'classes': classes, 'input_dim': trainShape, 'transforms': transformsValidate }, dataloader_params={ 'batch_size':BATCH_SIZE, 'num_workers':WORKERS } )

for k in train_data.dataset.transforms: k.additional_samples_count = addSamples

if resumePath: print('\nresuming training from %s \n' % resumePath) model = models.get(model_to_train, num_classes=len(classes), checkpoint_path=resumePath).to(DEVICE) else: print('\ntraining model from scratch\n') model = models.get(model_to_train, num_classes=len(classes)).to(DEVICE)

trainer.train( model=model, training_params=train_params, train_loader=train_data, valid_loader=val_data )

del model, trainer, train_data, val_data gc.collect() torch.cuda.empty_cache()


I tested model exports including postprocessing=False and calling the model without NMS. Are you suggesting, that while training there is still an influence of the NMS settings on the model, e.g. the maximum number of allowed detections per image? In this case the settings of the post prediction callback should be changed:

PPYoloEPostPredictionCallback(score_threshold=0.01, nms_top_k=1000, max_predictions=300,nms_threshold=0.7)

thanks for the support!