Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.54k stars 496 forks source link

Inference Speed Calculation #1310

Closed HaoqianSong closed 1 year ago

HaoqianSong commented 1 year ago

🚀 Feature Request

How to output the inference time for each image during inference, or output the average time for inference of all images, so as to facilitate the calculation of FPS indicators.

Proposed Solution (Optional)

How to output the inference time for each image during inference, or output the average time for inference of all images, so as to facilitate the calculation of FPS indicators.

BloodAxe commented 1 year ago

You may think you want to measure FPS of native pytorch inference speed, but in fact you don't.

Here's why: Pytorch eager execution mode is great for debugging and ease of understanding how model works. However it comes at the cost of increased inference latency

Pytorch native forward() does not employ any runtime optimization and layers fusion

Neither it optimize memory usage by keeping intermediate buffers to reduce fragmentation of vram

Even torch tracing would give a better performance that plain pytorch model.

Measuring FPS of inference under these conditions would give you very biased results. If your goal is to get the maximum performance the recommended way to use tensorrt or onnxruntime and not plain pytorch model.

HaoqianSong commented 1 year ago

OK, thanks! How can I output the delay time of model inference? For example, when YOLO reasoning: with dt[0]: im = torch.from_numpy(im).to(model.device) im = im.half() if model.fp16 else im.float() # uint8 to fp16/32 im /= 255 # 0 - 255 to 0.0 - 1.0 if len(im.shape) == 3: im = im[None] # expand for batch dim

    # Inference
    with dt[1]:
        visualize = increment_path(save_dir / Path(path).stem, mkdir=True) if visualize else False
        pred = model(im, augment=augment, visualize=visualize)

    # NMS
    with dt[2]:
        pred = non_max_suppression(pred, conf_thres, iou_thres, classes, agnostic_nms, max_det=max_det)

LOGGER.info(f"{s}{'' if len(det) else '(no detections), '}{dt[1].dt 1E3:.1f}ms") t = tuple(x.t / seen 1E3 for x in dt) # speeds per image LOGGER.info(f'Speed: %.1fms pre-process, %.1fms inference, %.1fms NMS per image at shape {(1, 3, *imgsz)}' % t)

The output:image 97/98 D:\DetectionAlgorithm\DataSet\YOLO\YOLOinput\images\test\931898198569996288_0.jpg: 384x640 4 persons, 26.0ms image 98/98 D:\DetectionAlgorithm\DataSet\YOLO\YOLOinput\images\test\96.jpg: 448x640 2 persons, 26.9ms Speed: 0.5ms pre-process, 30.0ms inference, 1.8ms NMS per image at shape (1, 3, 640, 640)

BloodAxe commented 1 year ago

Please follow our docs describing how to properly benchmark the model https://docs.deci.ai/super-gradients/documentation/source/BenchmarkingYoloNAS.html

siddagra commented 11 months ago

You may think you want to measure FPS of native pytorch inference speed, but in fact you don't.

Here's why: Pytorch eager execution mode is great for debugging and ease of understanding how model works. However it comes at the cost of increased inference latency

Pytorch native forward() does not employ any runtime optimization and layers fusion

Neither it optimize memory usage by keeping intermediate buffers to reduce fragmentation of vram

Even torch tracing would give a better performance that plain pytorch model.

Measuring FPS of inference under these conditions would give you very biased results. If your goal is to get the maximum performance the recommended way to use tensorrt or onnxruntime and not plain pytorch model.

I had a doubt then: did you guys use tensorrt/onnx for your benchmarks for other models? (yolov8, yolov5, yolov6, yolov7, etc?) for your comparisions at https://github.com/Deci-AI/super-gradients/raw/master/documentation/source/images/yolo_nas_frontier.png. Because if not, could they not potentially be faster than yoloNAS with tensorrt optimizations?

BloodAxe commented 11 months ago

Have no doubt - we used TensorRT for GPU and OpenVino for CPU benchmarks

siddagra commented 11 months ago

Thank you for the information :). I would love to try it on some images using TensorRT but it is proving difficult to do so.