deepcam-cn / yolov5-face

YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931) ECCV Workshops 2022)
GNU General Public License v3.0
2.04k stars 497 forks source link

Some questions about TensorRT for Yolov5-face #76

Closed vtddggg closed 2 years ago

vtddggg commented 2 years ago

@bobo0810 Really thanks for your TensorRT inference implementation!! There are some questions after successfully running the TensorRT of Yolov5-face:

  1. The results in table look very impressive. But in my case, I test the RT time on 2080ti GPU after running following two codes:

    start = time()
    for i in range(1000):
    pred=yolo_trt_model(img.cpu().numpy()) #tensorrt推理
    ends = time()
    print('RT for one image: {} ms'.format(ends-start))

    This code gives the RT for one image: 6 ms.

    start = time()
    for i in range(1000):
    pred=yolo_trt_model(img.cpu().numpy()) #tensorrt推理
    pred=yolo_trt_model.after_process(pred,device)
    ends = time()
    print('RT for one image: {} ms'.format(ends-start))

    This code gives the RT for one image: 11 ms. Is such a test of RT time right in my understanding ?

  2. It seems yolo_trt_model.after_process cost much time. Why not put this process into TensorRT, by uncommenting this line ? I find in original yolov5 repo, the overall model can be exported by this file. Is it possible to put the entire process of Yolov5-face into TenserRT?

  3. The results in the table only consider the yolo_trt_model.__call__ running time, or both yolo_trt_model.__call__, yolo_trt_model.after_process and non_max_suppression_face are considered ?

bobo0810 commented 2 years ago
  1. (1)Warm up before speed measurement. (2)Different hardware and TensorRT versions support different acceleration strategies, which means that the acceleration effect is related to the hardware platform.
  2. It is currently the simplest implementation. If there is time in the future, it will be converted to a full TensorRT implementation.
  3. The result of the table is tensorrt inference + torch post-processing, which is used to compare the time consumption of Pytorch forward inference. Therefore, image pre-processing, NMS, etc. are unnecessary.
vtddggg commented 2 years ago

@bobo0810 Thank you for your reply. what hardware and TensorRT versions do you use?

bobo0810 commented 2 years ago

@bobo0810 Thank you for your reply. what hardware and TensorRT versions do you use?

vtddggg commented 2 years ago

@bobo0810 Thank you for your reply. what hardware and TensorRT versions do you use?

  • RTX2080Ti
  • TensorRT7.2.2-1
  • cuda11.1

Oh, looks I have a different version of TensorRT and cuda. and which opset version you used for onnx export?

bobo0810 commented 2 years ago

@bobo0810 Thank you for your reply. what hardware and TensorRT versions do you use?

  • RTX2080Ti
  • TensorRT7.2.2-1
  • cuda11.1

Oh, looks I have a different version of TensorRT and cuda. and which opset version you used for onnx export?

vtddggg commented 2 years ago

@bobo0810 Thank you for your reply. what hardware and TensorRT versions do you use?

  • RTX2080Ti
  • TensorRT7.2.2-1
  • cuda11.1

Oh, looks I have a different version of TensorRT and cuda. and which opset version you used for onnx export?

  • onnx 1.8.0
  • onnxruntime 1.7.0
  • torch.onnx.export(opset_version=12)

👍, I am going to reproduce the results with these requirements you provided.

Tetsujinfr commented 2 years ago

@bobo0810 did you use batch=1 for your trt inference speed test or did you use a higher number?

bobo0810 commented 2 years ago

@bobo0810 did you use batch=1 for your trt inference speed test or did you use a higher number?

batch=1

Tetsujinfr commented 2 years ago

ok thanks. Did you use FP16 too?

I struggle to get any performance improvement with tensorrt with my 980ti with trt7.2.3. I appreciate the hardware is different from your but I wonder why there is such a performance challenge on my end. On my side the trt performance is worse than the regular repo code... (but I can still here my GPU fans so it is used a bit). Maybe I did not modify the code appropriately.

Any chance you can the trt code you used for your benchmark numbers?

bobo0810 commented 2 years ago

Any chance you can the trt code you used for your benchmark numbers?

https://github.com/deepcam-cn/yolov5-face/issues/76#issue-1022279361 Similar to the second code, remember to warm up before testing

Tetsujinfr commented 2 years ago

What do you mean by warm up? If that relates to the slow initial inference yes I observe it, the code hangs up for like 10-15seconds the first time, but then even after that it takes about 250ms per frame inference (that includes pre-processing, inference and post processing, but most of the time is due to inference). This is at par or even smower than the std implementation. There is something I am lilely doing wrong since I know tensorrt speed up other repo I have worked with.

bobo0810 commented 2 years ago

What do you mean by warm up? If that relates to the slow initial inference yes I observe it, the code hangs up for like 10-15seconds the first time, but then even after that it takes about 250ms per frame inference (that includes pre-processing, inference and post processing, but most of the time is due to inference). This is at par or even smower than the std implementation. There is something I am lilely doing wrong since I know tensorrt speed up other repo I have worked with.

https://github.com/deepcam-cn/yolov5-face/issues/76#issuecomment-939733533 So, it may be due to the hardware.

Tetsujinfr commented 2 years ago

I will test on a rtx3090 soon, will share how that went.