PaddlePaddle / FastDeploy

⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.
https://www.paddlepaddle.org.cn/fastdeploy
Apache License 2.0
2.97k stars 462 forks source link

Python统计耗时明显高于模型内置统计耗时 #2536

Open ucsk opened 2 weeks ago

ucsk commented 2 weeks ago

环境

性能疑问

import time
import fastdeploy as fd
import numpy as np
import statistics

if __name__ == '__main__':
    option = fd.RuntimeOption()
    option.use_gpu(0)
    option.use_trt_backend()
    option.trt_option.enable_fp16 = True
    option.trt_option.set_shape('images', [1, 3, 640, 640], [1, 3, 640, 640], [40, 3, 640, 640])
    option.trt_option.serialize_file = 'weights/yolov8m.engine'
    model = fd.vision.detection.YOLOv8('weights/yolov8m.onnx', runtime_option=option)

    ims = [np.random.randint(0, 256, (360, 640, 3), dtype=np.uint8) for _ in range(20)]

    model.enable_record_time_of_runtime()
    costs = []
    for i in range(500):
        if 100 <= i:
            begin = time.perf_counter()
        results = model.batch_predict(ims)
        if 100 <= i:
            costs.append(time.perf_counter() - begin)
    model.print_statis_info_of_runtime()

    print(f'{int(1000 * statistics.mean(costs))}ms')
$ python benchmark.py 
[INFO] fastdeploy/runtime/backends/tensorrt/trt_backend.cc(719)::CreateTrtEngineFromOnnx    Detect serialized TensorRT Engine file in weights/yolov8m.engine, will load it directly.
[INFO] fastdeploy/runtime/backends/tensorrt/trt_backend.cc(108)::LoadTrtCache   Build TensorRT Engine from cache file: weights/yolov8m.engine with shape range information as below,
[INFO] fastdeploy/runtime/backends/tensorrt/trt_backend.cc(111)::LoadTrtCache   Input name: images, shape=[-1, 3, -1, -1], min=[1, 3, 640, 640], max=[40, 3, 640, 640]

[INFO] fastdeploy/runtime/runtime.cc(339)::CreateTrtBackend Runtime initialized with Backend::TRT in Device::GPU.
============= Runtime Statis Info(yolov8) =============
Total iterations: 500
Total time of runtime: 29.7184s.
Warmup iterations: 100
Total time of runtime in warmup step: 6.63981s.
Average time of runtime exclude warmup step: 57.6966ms.
118ms
Jiang-Jia-Jun commented 2 weeks ago

模型内置的,统计的单纯是推理引擎的耗时。 而Python端,统计的是包含数据前后处理+推理引擎耗时

ucsk commented 1 week ago

模型内置的,统计的单纯是推理引擎的耗时。 而Python端,统计的是包含数据前后处理+推理引擎耗时

目前YOLOv8的预处理没有继承ProcessorManager,不支持CVCUDA加速。

请问如果适配这部分代码之后,如何正确的在Python将默认的预处理替换为CVCUDA?

是否仅需要初始化模型并调用接口model.preprocessor.use_cuda(True, 0)

model = fd.vision.detection.YOLOv8(...)
# model.preprocessor.use_cuda(True, 0)  # CPU
model.preprocessor.use_cuda(False, 0)  # CUDA
model.preprocessor.use_cuda(True, 0)  # CVCUDA