deepinsight / insightface

State-of-the-art 2D and 3D Face Analysis Project
https://insightface.ai
22.82k stars 5.34k forks source link

Getting Inference time for SCRFD_500M => 31.227 millisecond vs 3.6 millisecond claimed ? how ? #1761

Open dexception opened 2 years ago

dexception commented 2 years ago

image

As you can see i ran the loop 10 times to confirm the result.

image

The results are way off the time given by you. Am i missing something ?

Note: Onnx file exported from the model.pth file given at the following link. https://1drv.ms/u/s!AswpsDO2toNKqyYWxScdiTITY4TQ?e=DjXof9

nttstar commented 2 years ago

make sure you are using onnxruntime-gpu

joytsay commented 2 years ago

@nttstar I'm facing the same problem with scrfd_500m.onnx my onnxruntime-gpu is on checked by:

python -c "import onnxruntime as ort; print(ort.get_device())"
>>> GPU

performed 10 runs:

all cost: 15.78
all cost: 11.540999999999999
all cost: 9.548
all cost: 9.395000000000001
all cost: 11.373
all cost: 10.758000000000001
all cost: 10.706
all cost: 11.953
all cost: 12.467
all cost: 12.462000000000002

my GPU is RTX 3080, you claim to have 3.6ms on AMD Ryzen 9 3950X

joytsay commented 2 years ago

To answer my question above I came upon https://github.com/deepinsight/insightface/tree/master/python-package and used this instead

pip install -U insightface

after doing:

pip install onnxruntime-gpu

I ran https://github.com/deepinsight/insightface/blob/master/python-package/insightface/model_zoo/scrfd.py and changed

https://github.com/deepinsight/insightface/blob/06897de50e327e01a33582955d5cb4222d0e67b5/python-package/insightface/model_zoo/scrfd.py#L321 to

detector = SCRFD(model_file='/root/.insightface/models/buffalo_m/det_2.5g.onnx')
detector.prepare(0) # original ctx_id -1 is for CPU, 0 is for GPU id 

and also changed https://github.com/deepinsight/insightface/blob/06897de50e327e01a33582955d5cb4222d0e67b5/python-package/insightface/model_zoo/scrfd.py#L330 to

bboxes, kpss = detector.detect(img, input_size = (640, 640))

the results was:

all cost: 733.7719999999999 (#gpu cold run )
all cost: 5.401999999999999 (#run1)
all cost: 4.316000000000001 (#run2)

Here I used SCRFD_2.5G getting 4.31ms, which is reasonable on my RTX 3080

nttstar commented 2 years ago

@joytsay Yes and this 4.31ms includes post-processing.

QAQEthan commented 2 years ago

@joytsay did you solve this problem?the inference time on scrfd_500m.onnx is way off the paper

joytsay commented 2 years ago

@Monkey-D-Luffy-star yes here is my infer time on RTX 3080: p.s. scrfd_500m.onnx is SCRFD_0.5GF

Model Backbone Input RTX 3080 Linux
CenterFace MobileNetV2 800x800 8.55ms
RetinaFace MobileNet0.25 640x640 22.19ms
SCRFD_0.5GF Depth-wise Conv 640x640 3.625ms
SCRFD_2.5GF Basic Res 640x640 4.239ms
SCRFD_10GF Basic Res 640x640 5.875ms
QAQEthan commented 2 years ago

@joytsay thx, How did you solve this problem?

joytsay commented 2 years ago

As mentioned above, i used the python-package withonnxruntime-gpu installed my docker environment is:

docker pull nvcr.io/nvidia/mxnet:21.09-py3

( since I wanted to benchmark RetinaFace in mxnet environment )

and after starting container with:

docker run --gpus all --shm-size=8g -it -v $PWD:/insight-dir nvcr.io/nvidia/mxnet:21.09-py3 bash

i did the following steps in my last quote within the mxnet container:

To answer my question above I came upon https://github.com/deepinsight/insightface/tree/master/python-package and used this instead

pip install -U insightface

after doing:

pip install onnxruntime-gpu

I ran https://github.com/deepinsight/insightface/blob/master/python-package/insightface/model_zoo/scrfd.py and changed

https://github.com/deepinsight/insightface/blob/06897de50e327e01a33582955d5cb4222d0e67b5/python-package/insightface/model_zoo/scrfd.py#L321

to

detector = SCRFD(model_file='/root/.insightface/models/buffalo_m/det_2.5g.onnx')
detector.prepare(0) # original ctx_id -1 is for CPU, 0 is for GPU id 

and also changed

https://github.com/deepinsight/insightface/blob/06897de50e327e01a33582955d5cb4222d0e67b5/python-package/insightface/model_zoo/scrfd.py#L330

to

bboxes, kpss = detector.detect(img, input_size = (640, 640))

the results was:

all cost: 733.7719999999999 (#gpu cold run )
all cost: 5.401999999999999 (#run1)
all cost: 4.316000000000001 (#run2)

Here I used SCRFD_2.5G getting 4.31ms, which is reasonable on my RTX 3080

QAQEthan commented 2 years ago

@joytsay Thank you for your answer. I would like to know the reason for your previous mistake , like this.

all cost: 15.78 all cost: 11.540999999999999 all cost: 9.548 all cost: 9.395000000000001 all cost: 11.373 all cost: 10.758000000000001 all cost: 10.706 all cost: 11.953 all cost: 12.467 all cost: 12.462000000000002

joytsay commented 2 years ago

This is due to using conda directly in ubuntu. Somehow onnx runtime can't trigger gpu even when it says it does:


python -c "import onnxruntime as ort; print(ort.get_device())"
>>> GPU

I ended up using docker instead.

QAQEthan commented 2 years ago

Well, maybe I have the same problem as you, I'll try it. thx.

QAQEthan commented 2 years ago

@joytsay Hi,bro,As you said, onnxruntime can't trigger the GPU, but I found this to be due to onnxruntime version issues.I installed too high onnxruntime version, when I lowered the version, I successfully ran on GPU (now onnxruntime version is 1.4), but the result is still a little different from yours, and fluctuated greatly, do you think this is caused by the version problem?Here are the scrfd2.5g.onnx results on 2080Ti

all cost: 9.543 all cost: 9.399 all cost: 9.32 all cost: 13.374 all cost: 25.564 all cost: 26.359 all cost: 27.944 all cost: 27.301

jiangxiangchuan commented 2 years ago

@joytsay @nttstar My GPU is RTX 3080, and my cpu is Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz. The model file is scrfd_10g_bnkps.onnx. I ran https://github.com/deepinsight/insightface/blob/master/python-package/insightface/model_zoo/scrfd.py, and the all cost time is about 10ms. The time mentioned in paper is 5ms. Then I tested the time of the step in the forward function of the class SCRFD: "net_outs = self.session.run(self.output_names, {self.input_name : blob})", and I got the time of 5ms around.
So my question is : did the inference time mentioned in the paper include the post-processing time? Or my CPU performance is lower? Or some other reasons?

wenzhengzeng commented 1 year ago

@jiangxiangchuan I obtain a similar result on 3090. I think the reported time (i.e., 4.9ms) just contains the session.run function. The time for data pre-processing is not included.