Python统计耗时明显高于模型内置统计耗时

ucsk commented 2 weeks ago

环境

【FastDeploy版本】：fastdeploy-linux-gpu-0.0.0
【系统平台】: Linux x64 (Ubuntu 20.04)
【硬件】： Nvidia GPU 4060Ti, conda-forge: CUDA 11.7, CUDNN 8.4
【编译语言】： Python (3.10）

性能疑问

FastDeploy模型内置的压测统计耗时和Python层面的统计耗时不一致的问题。
如何缩小内置耗时57.6966ms和Python耗时118ms的差距。

import time
import fastdeploy as fd
import numpy as np
import statistics

if __name__ == '__main__':
    option = fd.RuntimeOption()
    option.use_gpu(0)
    option.use_trt_backend()
    option.trt_option.enable_fp16 = True
    option.trt_option.set_shape('images', [1, 3, 640, 640], [1, 3, 640, 640], [40, 3, 640, 640])
    option.trt_option.serialize_file = 'weights/yolov8m.engine'
    model = fd.vision.detection.YOLOv8('weights/yolov8m.onnx', runtime_option=option)

    ims = [np.random.randint(0, 256, (360, 640, 3), dtype=np.uint8) for _ in range(20)]

    model.enable_record_time_of_runtime()
    costs = []
    for i in range(500):
        if 100 <= i:
            begin = time.perf_counter()
        results = model.batch_predict(ims)
        if 100 <= i:
            costs.append(time.perf_counter() - begin)
    model.print_statis_info_of_runtime()

    print(f'{int(1000 * statistics.mean(costs))}ms')

$ python benchmark.py 
[INFO] fastdeploy/runtime/backends/tensorrt/trt_backend.cc(719)::CreateTrtEngineFromOnnx    Detect serialized TensorRT Engine file in weights/yolov8m.engine, will load it directly.
[INFO] fastdeploy/runtime/backends/tensorrt/trt_backend.cc(108)::LoadTrtCache   Build TensorRT Engine from cache file: weights/yolov8m.engine with shape range information as below,
[INFO] fastdeploy/runtime/backends/tensorrt/trt_backend.cc(111)::LoadTrtCache   Input name: images, shape=[-1, 3, -1, -1], min=[1, 3, 640, 640], max=[40, 3, 640, 640]

[INFO] fastdeploy/runtime/runtime.cc(339)::CreateTrtBackend Runtime initialized with Backend::TRT in Device::GPU.
============= Runtime Statis Info(yolov8) =============
Total iterations: 500
Total time of runtime: 29.7184s.
Warmup iterations: 100
Total time of runtime in warmup step: 6.63981s.
Average time of runtime exclude warmup step: 57.6966ms.
118ms

Jiang-Jia-Jun commented 2 weeks ago

模型内置的，统计的单纯是推理引擎的耗时。而Python端，统计的是包含数据前后处理+推理引擎耗时

ucsk commented 1 week ago

模型内置的，统计的单纯是推理引擎的耗时。而Python端，统计的是包含数据前后处理+推理引擎耗时

目前YOLOv8的预处理没有继承ProcessorManager，不支持CVCUDA加速。

请问如果适配这部分代码之后，如何正确的在Python将默认的预处理替换为CVCUDA？

是否仅需要初始化模型并调用接口model.preprocessor.use_cuda(True, 0)：

model = fd.vision.detection.YOLOv8(...)
# model.preprocessor.use_cuda(True, 0)  # CPU
model.preprocessor.use_cuda(False, 0)  # CUDA
model.preprocessor.use_cuda(True, 0)  # CVCUDA

PaddlePaddle / FastDeploy

Python统计耗时明显高于模型内置统计耗时 #2536

环境

性能疑问