Robert-JunWang / Pelee

Pelee: A Real-Time Object Detection System on Mobile Devices
Apache License 2.0
885 stars 254 forks source link

fps of Pelee on TX1 #43

Open xonobo opened 6 years ago

xonobo commented 6 years ago

At first, thank you for the good work.

Including the pre and post processings, I got 14 fps detection rate on Nvidia TX1 board. I didn't utilize tensorRT engine. Pretty good results but still want to get your comments on it. Is it reasonable with respect to fps results given for the iphones?

Robert-JunWang commented 6 years ago

PeleeNet is built in a multi-branch and narrow channel style. TensorRT can combine many small branches of PeleeNet into a large layer and speed up the running time greatly. I do not have TX1 but the result on TX2 is not bad. This version of Pelee (304x304) can run over 71 FPS on TX2+TensorRT3.0. With some small changes to the architecture, it can run over 104 FPS.

lucheng07082221 commented 6 years ago

@xonobo How to tansfer pepee to tensorRT? Can you share your experience?

lqs19881030 commented 6 years ago

@Robert-JunWang can you tell me how about changes the architecture and it can run over 104fps。

Ghustwb commented 6 years ago

PeleeNet is built in a multi-branch and narrow channel style. TensorRT can combine many small branches of PeleeNet into a large layer and speed up the running time greatly. I do not have TX1 but the result on TX2 is not bad. This version of Pelee (304x304) can run over 71 FPS on TX2+TensorRT3.0. With some small changes to the architecture, it can run over 104 FPS.

PeleeNet is built in a multi-branch and narrow channel style. TensorRT can combine many small branches of PeleeNet into a large layer and speed up the running time greatly. I do not have TX1 but the result on TX2 is not bad. This version of Pelee (304x304) can run over 71 FPS on TX2+TensorRT3.0. With some small changes to the architecture, it can run over 104 FPS.

It is so cool!!

Ghustwb commented 6 years ago

@Robert-JunWang Hi thanks for your work I only got 48fps on TX2 + TensorRT3.0.4,it is slower than mobileNet-SSD(54fps ,group conv). you can run 70+fps,can you share you experience? And can you tell me how to changes the architecture to get over 104fps. Thanks

Robert-JunWang commented 6 years ago

I did not do any specific processing and just converted the merged Caffe model to TensorRT engine file. This speed is running FP32, The speed of the FP16 model is over 100 FPS. I am surprised to know that you can run mobilenet+ssd over 54FPS on TensorRT3 with grouped conv. In my experiment, TensorRT3.0 has a very bad performance for grouped conv. It even much slower than NVCaffe running on CPU. The performance of grouped conv is improved greatly on TensorRT4. MobileNet+SSD runs at the similar speed to Pelee on FP32 mode on TensorRT4. However, MobileNet cannot benefit from FP16 inference on TX2. The model runs on FP16 mode is almost the same as the one on FP32 mode.

Ghustwb commented 6 years ago

I did not do any specific processing and just converted the merged Caffe model to TensorRT engine file. This speed is running FP32, The speed of the FP16 model is over 100+ PFS. I am surprised to know that you can run mobilenet+ssd over 54FPS on TensorRT3 with grouped conv. In my experiment, TensorRT3.0 has a very bad performance for grouped conv. It even much slower than NVCaffe running on CPU. The performance of grouped conv is improved greatly on TensorRT4. MobileNet+SSD runs at the similar speed to Pelee on FP32 mode on TensorRT4. However, MobileNet cannot benefit from FP16 inference on TX2. The model runs on FP16 mode is almost the same as the one on FP32 mode.

Grouped conv has been optimized in cudnn7,the inference time of group conv depends on cudnn libraries.I think ,in the same cudnn version,whether use tensorRT3 or tensorRT4,the time cost is the same. Yes,you are right.MobileNet cannot benefit from FP16 inference on TX2. The model runs on FP16 mode is almost the same as the one on FP32 mode.MobileNet with Fp32 runs 50fps,FP16 runs 54fps,in my TX2. Thanks for your reply,I will retry it

Robert-JunWang commented 6 years ago

I guess you did not use jetson_clocks.sh to maximum GPU and CPU clockspeeds. After setting, both Pelee and SSD+MobileNet run over 70 FPS in FP32 mode. Pelee runs slightly faster than SSD+MobileNet in FP32 mode and much faster in FP16 mode on my TX2 (TensorRT4).

xonobo commented 5 years ago

May I ask a more generic question about the TX deployments. As long as I know tensorRT misses some layers used in SSD like reshape, priorbox, detectionoutput. In your TX timing experiments how did you overcome this issue? Did you implemented your own plugin tensorRT layers for the missings or used some available code around?

cathy-kim commented 5 years ago

@xonobo I uploaded my TensorRT code for Pelee here: https://github.com/ginn24/Pelee-TensorRT