deepinsight / insightface

State-of-the-art 2D and 3D Face Analysis Project
https://insightface.ai
22.82k stars 5.34k forks source link

Why my retinaface_mnet_0.25 is running at 80ms on RTX2080Ti when the paper says it runs at 6ms on P40? #1028

Open yxchng opened 4 years ago

yxchng commented 4 years ago

Isn't RTX2080Ti faster than P40?

xsacha commented 4 years ago

80ms is far too high. That's how fast it runs on an Intel 8-core CPU on Linux. Make sure it's not running on CPU.

yxchng commented 4 years ago

@xsacha What is the speed you get? I am talking about image of size 1920x1080 and time is inclusive of preprocessing and postprocessing.

xsacha commented 4 years ago

I am getting 2ms inference when converted to Pytorch JIT on a Tesla T4 (which is like a slower RTX2080) but I believe MXNet is similar speed. On Windows too.

I believe any times you see quoted are inference only and pre/post-processing depends on your own code. I get roughly 2ms pre + post processing time as well using NPPI.

MobileNet 0.25 is extremely fast but not the most accurate. The speeds you are getting definitely seem like CPU.

yxchng commented 4 years ago

@xsacha Comparing it with an optimized code is pointless. I got by timing by using this official API (no JIT, no NPPI, no C++, only Python, Numpy, MxNet), http://insightface.ai/build/examples_face_detection/demo_retinaface.html. The corresponding code is here, https://github.com/deepinsight/insightface/blob/master/python-package/insightface/model_zoo/face_detection.py. I just want to know if getting 80ms with this code is normal.

And indeed the preprocessing and postprocessing of this official code is on CPU. Only the inference code is on GPU.

In particular, I time this line

bbox, landmark = model.detect(img, threshold=0.5, scale=1.0)

Since you got 2ms, does that mean that you make your whole inference (inclusive of pre and post-processing) faster than this person (https://github.com/clancylian/retinaface) code? The following is a snippet from his Github. He only got 2ms for small image (448x448). My test is on large image (1920x1080).

Screenshot from 2019-12-27 08-58-30

xsacha commented 4 years ago

My input image is 1920x1080, but the image is scaled to meet requirements of my facial recognition (> 20 pixels between eyes), which means the actual size of the image passed to the detector is much smaller. The scaling factor I use for 1920x1080 is 0.4x which gives an inference resolution of 768x432. I get 2ms for this on a T4 but using a different framework (PyTorch JIT in C++).

no JIT, no NPPI, no C++

If you look at the screenshot you took, it says "use nvidia npp library to speed up preprocess". The timings you are reading are from someone using an optimised TensorRT implementation in C++ with NPPI.

By the way, the link you gave uses CPU. It appears to do age and gender as well and it says it took 30 seconds to process a few faces.

Use CPU to do all the job. Please change ctx-id to a positive number if you have GPUs

Cosin777 commented 4 years ago

@xsacha Hi. I can't make inference latency decrease when using Pytorch JIT in python interrupter. Did i do something wrong? Appreciate for your answer.

xsacha commented 4 years ago

@Cosin777 not sure why you can't get it to be faster. Jit is always faster than without. Although, for maximum speed you'll want to use TensorRT. The other frameworks are just easier to train with and use.

damvantai commented 4 years ago

hi @xsacha, @Cosin777, @yingfeng, @nttstar When i use mobilenet mnet025 for video 640x480 scale 1, GPU (rtx 4000) use 953MB/8GB and CPU ran 1600% (because cpu desktop have 40 cores) Why compute after inference network forwark (in GPU) still use more CPU, (high CPU)

xsacha commented 4 years ago

My CPU uses 0% for the same scenario. The only usage of CPU was the original loading of data. All your CPUs hitting 50% suggests it is using them for inferencing.

damvantai commented 4 years ago

My CPU uses 0% for the same scenario. The only usage of CPU was the original loading of data. All your CPUs hitting 50% suggests it is using them for inferencing.

HI @xsacha , when I use gpuid = -1, my cpu is 3500% (35/40 cores), while memory GPU = 0/8Gb Screenshot (25)

damvantai commented 4 years ago

hi @xsacha, When i use GPUid = 0, result is Screenshot (24)

xsacha commented 4 years ago

Could be due to loading image files or something. Did you eliminate these?

damvantai commented 4 years ago

Could be due to loading image files or something. Did you eliminate these? hi @xsacha, i thinks it is preprocess (load image, resize image) (500% cpu) and postprocess (1600% - 500% cpu) after inference predict (900MB GPU)

SthPhoenix commented 4 years ago

Hi! I have noticed same behaviour (high CPU utilization) when running retinaface mnet25, with mnet50 everything is normal. There is workaround for docker: set --cpuset-cpus to one core like this: --cpuset-cpus 0-0 This means docker container will use only one first core. It's utilization during inference will be around 90%, but you will gain about 20-30% boost on inference time.

If you want to run multiple containers just set it's CPU affinity to different cores.

With above workaround I was able to run up to 4 containers on Quadro p5000 to fully saturate CUDA cores and get overall fps around 60 frames for face detection+recognition with arcface-r100

xsacha commented 4 years ago

Should be able to get 160fps (mnet25 + arcface-100) on a p5000. Or higher with tensorrt conversion. If you are having trouble with CPU, this must be from loading the file from disk or calculating things you don't need to. Run a profiler to see what's using the CPU and then either pre-calc or offload.

SthPhoenix commented 4 years ago

Should be able to get 160fps (mnet25 + arcface-100) on a p5000. Or higher with tensorrt conversion. If you are having trouble with CPU, this must be from loading the file from disk or calculating things you don't need to. Run a profiler to see what's using the CPU and then either pre-calc or offload.

You mean 160 fps with MXNet framework? I was able to to get maximum 110-120 fps while running only detection without recognition with batch size 1 and image size 640x480. Any hints how to reach such fps for full recognition pipeline?

xsacha commented 4 years ago

Sorry that was with PyTorch framework (torch jit). Just make sure everything stays on GPU and you should get fastest result.

JohannesTK commented 3 years ago

@xsacha would you be so kind to share your PyTorch JIT script? Thanks!

xsacha commented 3 years ago

Use biubug's repo, ensure prelu are done in-place and then convert to jit and you will get same times.