biubug6 / Pytorch_Retinaface

Retinaface get 80.99% in widerface hard val using mobilenet0.25.
MIT License
2.63k stars 774 forks source link

FPS Problem #62

Open DingtianX opened 4 years ago

DingtianX commented 4 years ago

Why is the FPS only less than 10 frames in 1920 * 1080 resolution under GPU (CUDA) when testing video stream(with your pretrained model)?

DingtianX commented 4 years ago

GTX1070, gpu utilization is low

DingtianX commented 4 years ago
net = RetinaFace(cfg=cfg, phase='test')
net = load_model(net, args.trained_model, args.cpu)
net.eval()
print('Finished loading model!')
# print(net)
cudnn.benchmark = True
device = torch.device("cpu" if args.cpu else "cuda")

CPU usage is high, I don't think my GPU is used, but device shows CUDA is in use

brealisty commented 4 years ago

Why is the FPS only less than 10 frames in 1920 * 1080 resolution under GPU (CUDA) when testing video stream(with your pretrained model)?

I think it's priorbox, net forward need just about 0.008s, but the priorbox decode need about 0.14s. But I don't know how to optimize it

dufourpascal commented 4 years ago

Careful, CUDA kernel calls are asynchronous! E.g. the computation will not complete until later in you Python code. I believe the time measurement in detect.py is actually not measuring the execution time of the model correctly.

If I time it like this:

t0 = time.time()
loc, conf, landms = net(img)  # forward pass

t1 = time.time()
conf.cpu() # blocks until result is computed
t2 = time.time()

print(t1 - t0, t2 - t1)

I get a timing of roughly 0.01s (for t1 -t0) and 0.34s (for t2-t1).

You can disable the lazy evalution with by setting the environment variable CUDA_LAUNCH_BLOCKING=1 to test this.

DingtianX commented 4 years ago

Why is the FPS only less than 10 frames in 1920 * 1080 resolution under GPU (CUDA) when testing video stream(with your pretrained model)?

I think it's priorbox, net forward need just about 0.008s, but the priorbox decode need about 0.14s. But I don't know how to optimize it

My net forward time is about 0.02s(under gtx1070 or gtx970m) which gpu are you using?

brealisty commented 4 years ago

if ur resolution is fixed, u can just compute priorbox on time. set this outside value

---Original--- From: "MTCNN"<notifications@github.com> Date: Tue, Mar 3, 2020 14:52 PM To: "biubug6/Pytorch_Retinaface"<Pytorch_Retinaface@noreply.github.com>; Cc: "Comment"<comment@noreply.github.com>;"brealisty"<294120771@qq.com>; Subject: Re: [biubug6/Pytorch_Retinaface] FPS Problem (#62)

Why is the FPS only less than 10 frames in 1920 * 1080 resolution under GPU (CUDA) when testing video stream(with your pretrained model)?

I think it's priorbox, net forward need just about 0.008s, but the priorbox decode need about 0.14s. But I don't know how to optimize it

My net forward time is about 0.02s(under gtx1070 or gtx970m) which gpu are you using?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.