Robert-JunWang / Pelee

Pelee: A Real-Time Object Detection System on Mobile Devices
Apache License 2.0
885 stars 254 forks source link

Inference Time Issue #66

Open cathy-kim opened 5 years ago

cathy-kim commented 5 years ago

@Robert-JunWang Hi, Thanks to your work.

With the merged caffe model, I only got 48fps on TX2 + TensorRT4.1.4, it is slower than mobileNet-SSD(about 54fps). I've already optimized my TX2 with jetson_clocks.sh.
(I think I have already done what you suggested on Issue #43

Would you tell me how did you run 70+fps? Thanks

Shreeyak commented 5 years ago

@Robert-JunWang Could you also tell us how you get ~100 fps with yolov3-tiny? I'm running yolov3tiny-320, in tensorflow, without tensorrt and only getting ~12 fps. Clocks are maxed on my tx2. I don't understand the 10x performance gap!

Robert-JunWang commented 5 years ago

@Robert-JunWang Hi, Thanks to your work.

With the merged caffe model, I only got 48fps on TX2 + TensorRT4.1.4, it is slower than mobileNet-SSD(about 54fps). I've already optimized my TX2 with jetson_clocks.sh. (I think I have already done what you suggested on Issue #43

Would you tell me how did you run 70+fps? Thanks

That speed does not include the post-processing part(decode bounding boxes and nms). The post-processing part can be done on CPU asynchronously. The real E2E speed is almost the same as the one I report. Both mobilenet+ssd and Pelee runs over 70 FPS on FP32 mode.

Robert-JunWang commented 5 years ago

@Robert-JunWang Could you also tell us how you get ~100 fps with yolov3-tiny? I'm running yolov3tiny-320, in tensorflow, without tensorrt and only getting ~12 fps. Clocks are maxed on my tx2. I don't understand the 10x performance gap!

I created a Caffe model of tinyyolov3 by myself and tested the speed with the random weights. The speed also does not include the post-processing part. The input dim is 416, not 320. The only difference between my model and the original paper is that I use Relu instead of leaky relu. But I do not think this would make much difference in speed. TinyYOLOv3 can benefit from FP16 inference as well. The model on FP16 mode is about 1.8 times to 2 times faster than FP32 mode.

I never compare the speed of the tensorflow and tensorrt. But I do not think there is a 10 times gap between these two frameworks. You can remove the preprocessing and postprocessing part of your model and see whether there is any difference.

Shreeyak commented 5 years ago

Oh, thank you for the explanations, that makes a lot more sense now! I should also take a look at how to do the post-processing asynchronously. Would you happen to have a repo/post/example of how to do that?

Would you happen to have any benchmarks of fps including the post-processing?

dbellan commented 5 years ago

@ginn24 Could you please tell me how you defined the detection_out layer plugin?

I populate the Plugin factory with mDetection_out = std::unique_ptr<INvPlugin, decltype(nvPluginDeleter)> (createSSDDetectionOutputPlugin(params), nvPluginDeleter);

but during the building of the Engine, I have the following error: http://NvPluginSSD.cu:795 virtual void nvinfer1::plugin::DetectionOutput::configure(const nvinfer1::Dims, int, const nvinfer1::Dims, int, int): Assertion `numPriorsnumLocClasses4 == inputDims[param.inputOrder[0]].d[0]' failed.

Usually this error is due to a wrong name of the layer or the wrong structur of params.InputOrder, but everything is correct. I feel that it is something related to how I created the Plugin. May I ask how did you do?

cathy-kim commented 5 years ago

@dbellan I just uploaded my Pelee-TensorRT code. You can checkout here. https://github.com/ginn24/Pelee-TensorRT

The version of code visualizes detection_out. This code doesn't include measuring inference time. If you need to check the inference time, you should implement code for GPU time.

dbellan commented 5 years ago

Thank you @ginn24. I'll have a look