Open xonobo opened 6 years ago
PeleeNet is built in a multi-branch and narrow channel style. TensorRT can combine many small branches of PeleeNet into a large layer and speed up the running time greatly. I do not have TX1 but the result on TX2 is not bad. This version of Pelee (304x304) can run over 71 FPS on TX2+TensorRT3.0. With some small changes to the architecture, it can run over 104 FPS.
@xonobo How to tansfer pepee to tensorRT? Can you share your experience?
@Robert-JunWang can you tell me how about changes the architecture and it can run over 104fps。
PeleeNet is built in a multi-branch and narrow channel style. TensorRT can combine many small branches of PeleeNet into a large layer and speed up the running time greatly. I do not have TX1 but the result on TX2 is not bad. This version of Pelee (304x304) can run over 71 FPS on TX2+TensorRT3.0. With some small changes to the architecture, it can run over 104 FPS.
PeleeNet is built in a multi-branch and narrow channel style. TensorRT can combine many small branches of PeleeNet into a large layer and speed up the running time greatly. I do not have TX1 but the result on TX2 is not bad. This version of Pelee (304x304) can run over 71 FPS on TX2+TensorRT3.0. With some small changes to the architecture, it can run over 104 FPS.
It is so cool!!
@Robert-JunWang Hi thanks for your work I only got 48fps on TX2 + TensorRT3.0.4,it is slower than mobileNet-SSD(54fps ,group conv). you can run 70+fps,can you share you experience? And can you tell me how to changes the architecture to get over 104fps. Thanks
I did not do any specific processing and just converted the merged Caffe model to TensorRT engine file. This speed is running FP32, The speed of the FP16 model is over 100 FPS. I am surprised to know that you can run mobilenet+ssd over 54FPS on TensorRT3 with grouped conv. In my experiment, TensorRT3.0 has a very bad performance for grouped conv. It even much slower than NVCaffe running on CPU. The performance of grouped conv is improved greatly on TensorRT4. MobileNet+SSD runs at the similar speed to Pelee on FP32 mode on TensorRT4. However, MobileNet cannot benefit from FP16 inference on TX2. The model runs on FP16 mode is almost the same as the one on FP32 mode.
I did not do any specific processing and just converted the merged Caffe model to TensorRT engine file. This speed is running FP32, The speed of the FP16 model is over 100+ PFS. I am surprised to know that you can run mobilenet+ssd over 54FPS on TensorRT3 with grouped conv. In my experiment, TensorRT3.0 has a very bad performance for grouped conv. It even much slower than NVCaffe running on CPU. The performance of grouped conv is improved greatly on TensorRT4. MobileNet+SSD runs at the similar speed to Pelee on FP32 mode on TensorRT4. However, MobileNet cannot benefit from FP16 inference on TX2. The model runs on FP16 mode is almost the same as the one on FP32 mode.
Grouped conv has been optimized in cudnn7,the inference time of group conv depends on cudnn libraries.I think ,in the same cudnn version,whether use tensorRT3 or tensorRT4,the time cost is the same. Yes,you are right.MobileNet cannot benefit from FP16 inference on TX2. The model runs on FP16 mode is almost the same as the one on FP32 mode.MobileNet with Fp32 runs 50fps,FP16 runs 54fps,in my TX2. Thanks for your reply,I will retry it
I guess you did not use jetson_clocks.sh to maximum GPU and CPU clockspeeds. After setting, both Pelee and SSD+MobileNet run over 70 FPS in FP32 mode. Pelee runs slightly faster than SSD+MobileNet in FP32 mode and much faster in FP16 mode on my TX2 (TensorRT4).
May I ask a more generic question about the TX deployments. As long as I know tensorRT misses some layers used in SSD like reshape, priorbox, detectionoutput. In your TX timing experiments how did you overcome this issue? Did you implemented your own plugin tensorRT layers for the missings or used some available code around?
@xonobo I uploaded my TensorRT code for Pelee here: https://github.com/ginn24/Pelee-TensorRT
At first, thank you for the good work.
Including the pre and post processings, I got 14 fps detection rate on Nvidia TX1 board. I didn't utilize tensorRT engine. Pretty good results but still want to get your comments on it. Is it reasonable with respect to fps results given for the iphones?