Open cathy-kim opened 5 years ago
@Robert-JunWang Could you also tell us how you get ~100 fps with yolov3-tiny? I'm running yolov3tiny-320, in tensorflow, without tensorrt and only getting ~12 fps. Clocks are maxed on my tx2. I don't understand the 10x performance gap!
@Robert-JunWang Hi, Thanks to your work.
With the merged caffe model, I only got 48fps on TX2 + TensorRT4.1.4, it is slower than mobileNet-SSD(about 54fps). I've already optimized my TX2 with jetson_clocks.sh. (I think I have already done what you suggested on Issue #43
Would you tell me how did you run 70+fps? Thanks
That speed does not include the post-processing part(decode bounding boxes and nms). The post-processing part can be done on CPU asynchronously. The real E2E speed is almost the same as the one I report. Both mobilenet+ssd and Pelee runs over 70 FPS on FP32 mode.
@Robert-JunWang Could you also tell us how you get ~100 fps with yolov3-tiny? I'm running yolov3tiny-320, in tensorflow, without tensorrt and only getting ~12 fps. Clocks are maxed on my tx2. I don't understand the 10x performance gap!
I created a Caffe model of tinyyolov3 by myself and tested the speed with the random weights. The speed also does not include the post-processing part. The input dim is 416, not 320. The only difference between my model and the original paper is that I use Relu instead of leaky relu. But I do not think this would make much difference in speed. TinyYOLOv3 can benefit from FP16 inference as well. The model on FP16 mode is about 1.8 times to 2 times faster than FP32 mode.
I never compare the speed of the tensorflow and tensorrt. But I do not think there is a 10 times gap between these two frameworks. You can remove the preprocessing and postprocessing part of your model and see whether there is any difference.
Oh, thank you for the explanations, that makes a lot more sense now! I should also take a look at how to do the post-processing asynchronously. Would you happen to have a repo/post/example of how to do that?
Would you happen to have any benchmarks of fps including the post-processing?
@ginn24 Could you please tell me how you defined the detection_out layer plugin?
I populate the Plugin factory with mDetection_out = std::unique_ptr<INvPlugin, decltype(nvPluginDeleter)> (createSSDDetectionOutputPlugin(params), nvPluginDeleter);
but during the building of the Engine, I have the following error: http://NvPluginSSD.cu:795 virtual void nvinfer1::plugin::DetectionOutput::configure(const nvinfer1::Dims, int, const nvinfer1::Dims, int, int): Assertion `numPriorsnumLocClasses4 == inputDims[param.inputOrder[0]].d[0]' failed.
Usually this error is due to a wrong name of the layer or the wrong structur of params.InputOrder, but everything is correct. I feel that it is something related to how I created the Plugin. May I ask how did you do?
@dbellan I just uploaded my Pelee-TensorRT code. You can checkout here. https://github.com/ginn24/Pelee-TensorRT
The version of code visualizes detection_out. This code doesn't include measuring inference time. If you need to check the inference time, you should implement code for GPU time.
Thank you @ginn24. I'll have a look
@Robert-JunWang Hi, Thanks to your work.
With the merged caffe model, I only got 48fps on TX2 + TensorRT4.1.4, it is slower than mobileNet-SSD(about 54fps). I've already optimized my TX2 with jetson_clocks.sh.
(I think I have already done what you suggested on Issue #43
Would you tell me how did you run 70+fps? Thanks