spend more than 20ms on prepocess on jetson nano, batch=1

JinRanYAO commented 7 months ago

Hello, thank you for your excellent work. I am trying to use your yolov8-pose code in jetson nano for real-time detection. I set batchs=1, imageshape=640(h)x384(w). I can get right result, and I found that it costs 40+ ms on inference, but 20+ ms on preprocess. I think it takes too long time on preprocess. Is there anything wrong, and is there anything I can do to optimize it?

FeiYull commented 7 months ago

@JinRanYAO Is the data you're testing a picture or a video?

FeiYull commented 7 months ago

@JinRanYAO Try to use the following instructions to achieve fp16 quantization, and improve performance by about 100%

./trtexec --onnx=yolov8n-pose.onnx --saveEngine=yolov8n-pose-fp16.trt --buildOnly --minShapes=images:1x3x640x640 --optShapes=images:2x3x640x640 --maxShapes=images:4x3x640x640 --fp16

【FP32】: [04/07/2024-09:15:16] [I] preprocess time = 0.841472; infer time = 5.80734; postprocess time = 0.186192 [04/07/2024-09:15:16] [I] preprocess time = 0.837504; infer time = 5.76032; postprocess time = 0.13976 [04/07/2024-09:15:16] [I] preprocess time = 0.845184; infer time = 5.75726; postprocess time = 0.209248 [04/07/2024-09:15:16] [I] preprocess time = 0.839952; infer time = 5.76222; postprocess time = 0.170016 [04/07/2024-09:15:16] [I] preprocess time = 0.844816; infer time = 5.76472; postprocess time = 0.146288 [04/07/2024-09:15:16] [I] preprocess time = 0.838784; infer time = 5.76434; postprocess time = 0.203216 [04/07/2024-09:15:16] [I] preprocess time = 0.808864; infer time = 5.5223; postprocess time = 0.150368 [04/07/2024-09:15:16] [I] preprocess time = 0.811856; infer time = 5.52139; postprocess time = 0.184 [04/07/2024-09:15:16] [I] preprocess time = 0.80856; infer time = 5.52371; postprocess time = 0.20792 [04/07/2024-09:15:16] [I] preprocess time = 0.809776; infer time = 5.51814; postprocess time = 0.168032 [04/07/2024-09:15:16] [I] preprocess time = 0.810064; infer time = 5.5215; postprocess time = 0.208496 [04/07/2024-09:15:16] [I] preprocess time = 0.811216; infer time = 5.51797; postprocess time = 0.201968 [04/07/2024-09:15:16] [I] preprocess time = 0.809136; infer time = 5.51658; postprocess time = 0.179296

【FP16】: [04/07/2024-09:15:26] [I] preprocess time = 0.84056; infer time = 2.59362; postprocess time = 0.177744 [04/07/2024-09:15:26] [I] preprocess time = 0.84752; infer time = 2.43448; postprocess time = 0.132512 [04/07/2024-09:15:26] [I] preprocess time = 0.840256; infer time = 2.42754; postprocess time = 0.206288 [04/07/2024-09:15:26] [I] preprocess time = 0.841216; infer time = 2.43272; postprocess time = 0.160144 [04/07/2024-09:15:26] [I] preprocess time = 0.840736; infer time = 2.42774; postprocess time = 0.137648 [04/07/2024-09:15:26] [I] preprocess time = 0.841296; infer time = 2.4313; postprocess time = 0.194464 [04/07/2024-09:15:26] [I] preprocess time = 0.840992; infer time = 2.43011; postprocess time = 0.149072 [04/07/2024-09:15:26] [I] preprocess time = 0.83664; infer time = 2.43083; postprocess time = 0.184176 [04/07/2024-09:15:26] [I] preprocess time = 0.841136; infer time = 2.4283; postprocess time = 0.20736 [04/07/2024-09:15:26] [I] preprocess time = 0.844864; infer time = 2.4312; postprocess time = 0.165424 [04/07/2024-09:15:26] [I] preprocess time = 0.842; infer time = 2.42846; postprocess time = 0.207552 [04/07/2024-09:15:26] [I] preprocess time = 0.8444; infer time = 2.43054; postprocess time = 0.203488 [04/07/2024-09:15:26] [I] preprocess time = 0.84024; infer time = 2.43106; postprocess time = 0.179952

JinRanYAO commented 7 months ago

@FeiYull Thank you for your quick reply!

My project is built on ROS, so I don't use utils::InputStream. I use yolov8.init() in the beginning. When I receive an image, I run the following code each frame. I think it is like using utils::InputStream::IMAGE? Is this code reasonable or somewhere can be improved?

    imgs_batch.emplace_back(frame.clone());
    yolov8.copy(imgs_batch);
utils::DeviceTimer d_t1; yolov8.preprocess(imgs_batch);  float t1 = d_t1.getUsedTime();
utils::DeviceTimer d_t2; yolov8.infer();                  float t2 = d_t2.getUsedTime();
utils::DeviceTimer d_t3; yolov8.postprocess(imgs_batch); float t3 = d_t3.getUsedTime();
float avg_times[3] = { t1, t2, t3 };
sample::gLogInfo << "preprocess time = " << avg_times[0] << "; "
    "infer time = " << avg_times[1] << "; "
    "postprocess time = " << avg_times[2] << std::endl;
yolov8.reset();
imgs_batch.clear();

Thanks, I try to use fp16, and the infer time decreased from 40ms to 30ms, with also 20ms preprocess time. Can I use int8 to get faster?
Additionally, my raw image shape is 1920x1080. Is too much time spent on resize?

FeiYull commented 7 months ago

@JinRanYAO It is recommended to enter the function YOLOv8Pose::preprocess to test the internal time overhead.

void YOLOv8Pose::preprocess(const std::vector& imgsBatch)

JinRanYAO commented 7 months ago

@FeiYull It seems that resize, bgr2rgb, norm and hwc2chw cost almost the same time, about 5ms for each process. Could I use the similar fuctions in opencv when I receive image, instead of using these processes here?

FeiYull commented 7 months ago

@JinRanYAO U can merge the following operations to one:

resizeDevice
bgr2rgbDevice
normDevice

Inside the resizeDevice's cuda kernel function you call, modify the following:

[modify bofore] https://github.com/FeiYull/TensorRT-Alpha/blob/bca9575229ef5f6fe4c5acf51c1bd3c7e5959ec6/utils/kernel_function.cu#L142

[modify after] ` //pdst[0] = c0; //pdst[1] = c1; //pdst[2] = c2;

// bgr2rgb pdst[0] = c2; pdst[1] = c1; pdst[2] = c0;

// normlization // float scale = 255.f // float means[3] = { 0.f, 0.f, 0.f }; // float stds[3] = { 1.f, 1.f, 1.f }; pdst[0] = (pdst[0] / scale - means[0]) / stds[0]; pdst[1] = (pdst[1] / scale - means[0]) / stds[0]; pdst[2] = (pdst[2] / scale - means[0]) / stds[0]; `

JinRanYAO commented 7 months ago

@FeiYull Thanks for your advice, the preprocess time decreases to 8ms after merging resize, bgr2rgb, norm to one. Then I resize the image to trtfile size when it is received, and use the same src_size and dst_size in yolov8-pose. Finally I simplify the preporcess code by deleting affinematrix and interpolation to save more time. Here is my code now.

__global__
void resize_rgb_padding_device_kernel(unsigned char* src, int src_width, int src_height, int src_area, int src_volume,
        float* dst, int dst_width, int dst_height, int dst_area, int dst_volume,
        int batch_size, float padding_value, float inv_scale)
{
    int dx = blockDim.x * blockIdx.x + threadIdx.x;
    int dy = blockDim.y * blockIdx.y + threadIdx.y;

    if (dx < dst_area && dy < batch_size)
    {
        int dst_y = dx / dst_width;
        int dst_x = dx % dst_width;

        unsigned char* v = src + dy * src_volume + dst_y * src_width * 3 + dst_x * 3;

        float* pdst = dst + dy * dst_volume + dst_y * dst_width * 3 + dst_x * 3;
        pdst[0] = (v[2] + 0.5f) * inv_scale;
        pdst[1] = (v[1] + 0.5f) * inv_scale;
        pdst[2] = (v[0] + 0.5f) * inv_scale;
    }
}

After simplifying, the preprocess time decreases to about 6ms, with right inference result. Is this code all right or anything can be improved?

FeiYull / TensorRT-Alpha

spend more than 20ms on prepocess on jetson nano, batch=1 #109