NVIDIA-AI-IOT / yolo_deepstream

yolo model qat and deploy with deepstream&tensorrt
Apache License 2.0
549 stars 139 forks source link

Performance report for tensorrt_yolov4 #8

Open jstumpin opened 3 years ago

jstumpin commented 3 years ago

As an extension to the preliminary benchmark for _tensorrtyolov4, batch inference performance is provided as follows:

repo. batch=1 batch=2 batch=4 batch=8
tkDNN N/A (N/A) 207.81 N/A N/A (N/A) 443.32 N/A
isarsoft 7.96 (N/A) 125.4 N/A 21.0 (N/A) 189.6 38.3 (N/A) 208.0
this 7.023 (2.61747) 120.831 4.393 (1.76344) 186.44 3.688 (1.26853) 223.68 3.42267 (0.888971) 239.063

where the metrics are formatted as: wall-time in ms (standard deviation of wall-time) frame-per-second. Wall-time only considers pre-processing + inference + post-processing times, while FPS is calculated based on end-to-end process; from image acquisition to image overlays without display.

For fairness, AlexeyAB's does not include FP16 numbers hence the exclusion. While all repositories are based on 320x320 input size and FP16 precision, the accompanying repositories are not to be directly compared as each is having its own metrics. More so, both are using NVIDIA GeForce RTX 2080 Ti, where as for this repository, I'm using NVIDIA GeForce RTX 2070.

AlexeyAB commented 3 years ago

@jstumpin Hi,

Are all these resluts for 320x320?


3.42267 (0.888971) 239.063

239 FPS with batch=8, it means 239/8 ~= 30 batches per 1 second, it means that latency can't be less than 30 ms ~= 1000/30.


For fairness, AlexeyAB's does not include FP16 numbers hence the exclusion.

What do you mean? There are results for FP16 and FP32: https://github.com/AlexeyAB/darknet#geforce-rtx-2080-ti

jstumpin commented 3 years ago

@AlexeyAB Hi,

Are all these resluts for 320x320?

3.42267 (0.888971) 239.063

Indeed they are.

239 FPS with batch=8, it means 239/8 ~= 30 batches per 1 second, it means that latency can't be less than 30 ms ~= 1000/30.

3.42267 wall-time is derived from the average of 3000-frames over the average of a single batch@8-frames, e.g.:

int batchSize = 8;
auto infer_start = std::chrono::steady_clock::now();
auto detections = infer(d_frames);
auto infer_end = std::chrono::steady_clock::now(); 
float infer_diff = std::chrono::duration_cast<std::chrono::milliseconds>(infer_end - infer_start).count();
avg_infer_times.push_back(infer_diff / batchSize);

therefore, avg_infer_times / 3000 = 3.42267

239.063 FPS follows the same route except that it includes frame/input grabbing and frame/output overlaying, e.g.:

int batchSize = 8;
auto total_start = std::chrono::steady_clock::now();
d_reader->nextFrame(d_frame);
d_frames.push_back(d_frame.clone());
counter++;
if (counter == batchSize) 
{
    counter = 0;
    ...
    auto detections = infer(d_frames);
    ...
    for (int b = 0; b < batchSize; ++b) 
    {
        d_frames[b].download(frame);
        for (auto detection : detections[b])
            draw(detection.classId, detection.confidence, detection.left, detection.top, detection.right, detection.bottom, frame);
    }
    auto total_end = std::chrono::steady_clock::now();
    float total_diff = std::chrono::duration_cast<std::chrono::milliseconds>(total_end - total_start).count();
    avg_total_fps.push_back(1000 / (total_diff / batchSize));
}

therefore, avg_total_fps / 3000 = 239.063

Thus latency is not considered.

For fairness, AlexeyAB's does not include FP16 numbers hence the exclusion.

What do you mean? There are results for FP16 and FP32: https://github.com/AlexeyAB/darknet#geforce-rtx-2080-ti

Second column seems like an FP32 performance numbers. The rest are third-party repos.

spacewalk01 commented 3 years ago

@jstumpin I am using RTX-3070 gpu with 8Gb memory for running Yolov4 model on TensorRT with fp16 precision. I've obtained 135fps on average with the preprocessing kernel function implemented by CaoWGG/TensorRT-YOLOv4 (no batch). But, with this github implementation (post & pre-processing), I've obtained about 40fps per a batch when batch_size=4 which gives me a total of 160fps. In your result table, the speed of batch processing (223.68 fps) when batch_size=4 is almost 2x times higher than that of batch_size=1 (120.831 fps). I wonder why it is much slower in my case.