ViTAE-Transformer / ViTPose

The official repo for [NeurIPS'22] "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation" and [TPAMI'23] "ViTPose++: Vision Transformer for Generic Body Pose Estimation"
Apache License 2.0
1.31k stars 175 forks source link

About Inference speed #48

Open gpastal24 opened 1 year ago

gpastal24 commented 1 year ago

Are you sure that this method is faster than HRNet? I have tried both with yolov5 as the detector in trt inference. HRNet achieves around 30-35 fps while VitPose can reach 7 fps at the same video with trt. Inference test I have conducted show that hrnet is 6-7 faster when using larger batch sizes for some reason (around 220 fps per target for fp16 and 450 fps for int8) while VitPose achieves around 60 fps per target in trt.

Annbless commented 1 year ago

Thanks for your attention. Please refer to the paper for the settings in the speed test. With the advanced GPUs and PyTorch framework, ViTPose is faster than HRNet. Besides, the inference speed using tensorRT is not only related to the model but also the configurations. e.g., the maximum memory allowed or the optimal searched calculation manner.

gpastal24 commented 1 year ago

Hi, thank you for answering. I had run a test in native Pytorch as well. ViTPose was indeed faster or similar when the batch size was equal to 1. I tested the yolov5 hrnet and vitpose pipeline with the webcam for single person infrerence and the VitPose method had indeed higher fps. When i increased the batch size to 10 ,HRNet for some reason was 2-3 times faster both in the inference test and with a video. If I understood correctly these results could be related to my GPU (GTX 1650)? In the following pictures I have attached the inference tests in Pytorch. The first row in each picture is with batch 1 while the second with batch size equal to 10. Screenshot from 2022-10-29 11-21-14 Screenshot from 2022-10-29 11-24-47