HRNet / DEKR

This is an official implementation of our CVPR 2021 paper "Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression" (https://arxiv.org/abs/2104.02300)
MIT License
437 stars 75 forks source link

How's the speed for inference time ? #1

Open Stephenfang51 opened 3 years ago

Stephenfang51 commented 3 years ago

Excellent Work ! just wonder any benchmark for inferece time ?

Thanks !

josemusso commented 3 years ago

Hi!. On a 1660ti max-q gives around 2.7 fps. On a Tesla T4 around 3 fps. I'm figuring out how to reduce inference time.

Stephenfang51 commented 3 years ago

Hi!. On a 1660ti max-q gives around 2.7 fps. On a Tesla T4 around 3 fps. I'm figuring out how to reduce inference time.

That slow !? Thanks for your feedback :)

xsacha commented 3 years ago

Getting 10fps without flip and 7fps with flip on a Titan V.

Haven't been able to trace the model and preserve size. Also, torchvision op (deform conv2d) is an issue when trying to run it in c++.

I saved a lot of time by removing a lot of the post-processing of the posemodel and heatmap. There's lots of work there and all it is doing is adding the x, y coordinate to every value in the posemodel.

josemusso commented 3 years ago

Getting 10fps without flip and 7fps with flip on a Titan V.

Haven't been able to trace the model and preserve size. Also, torchvision op (deform conv2d) is an issue when trying to run it in c++.

I saved a lot of time by removing a lot of the post-processing of the posemodel and heatmap. There's lots of work there and all it is doing is adding the x, y coordinate to every value in the posemodel.

Hi @xsacha ! Could you give me some detail regarding the removed post processing? or could you share your code? I m now starting tests on a RTX 3060, could give you some performance details later on.

ZXin0305 commented 3 years ago

Getting 10fps without flip and 7fps with flip on a Titan V.

Haven't been able to trace the model and preserve size. Also, torchvision op (deform conv2d) is an issue when trying to run it in c++.

I saved a lot of time by removing a lot of the post-processing of the posemodel and heatmap. There's lots of work there and all it is doing is adding the x, y coordinate to every value in the posemodel.

hello! Have you reduced the inference time again?? It is also a low speed about 10 fps when apply the project in reality. Thanks!

xsacha commented 3 years ago

Yeah I have given up on this one sorry since the inference time was too high and the deformable conv makes it harder to use. Dropping the deform conv made the accuracy lower.

diff --git a/lib/core/inference.py b/lib/core/inference.py
index 3be74a5..2f3ce42 100644
--- a/lib/core/inference.py
+++ b/lib/core/inference.py
@@ -16,7 +16,7 @@ from dataset.transforms import FLIP_CONFIG
 from utils.transforms import up_interpolate

-def get_locations(output_h, output_w, device):
+def get_locations(output_h, output_w, device, num_joints):
     shifts_x = torch.arange(
         0, output_w, step=1,
         dtype=torch.float32, device=device
@@ -26,34 +26,21 @@ def get_locations(output_h, output_w, device):
         dtype=torch.float32, device=device
     )
     shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
-    shift_x = shift_x.reshape(-1)
-    shift_y = shift_y.reshape(-1)
-    locations = torch.stack((shift_x, shift_y), dim=1)
-
+    locations = torch.stack((shift_x, shift_y), dim=0).repeat(num_joints, 1, 1)
     return locations

-
-def get_reg_poses(offset, num_joints):
-    _, h, w = offset.shape
-    offset = offset.permute(1, 2, 0).reshape(h*w, num_joints, 2)
-    locations = get_locations(h, w, offset.device)
-    locations = locations[:, None, :].expand(-1, num_joints, -1)
-    poses = locations - offset
-
-    return poses
-
-
 def offset_to_pose(offset, flip=True, flip_index=None):
     num_offset, h, w = offset.shape[1:]
-    num_joints = int(num_offset/2)
-    reg_poses = get_reg_poses(offset[0], num_joints)
+    locations = get_locations(h, w, offset.device, num_offset // 2)
+    reg_poses = locations - offset[0]

     if flip:
-        reg_poses = reg_poses[:, flip_index, :]
-        reg_poses[:, :, 0] = w - reg_poses[:, :, 0] - 1
+        reg_poses = reg_poses.view(-1, 2, h, w)
+        reg_poses = reg_poses[flip_index, :, :]
+        reg_poses[:, 0, :] = w - reg_poses[:, 0, :] - 1
+        reg_poses = reg_poses.reshape(num_offset, h, w)
 TEST:
   FLIP_TEST: True
   IMAGES_PER_GPU: 1
diff --git a/lib/core/inference.py b/lib/core/inference.py
index 3be74a5..2f3ce42 100644
--- a/lib/core/inference.py
+++ b/lib/core/inference.py
@@ -16,7 +16,7 @@ from dataset.transforms import FLIP_CONFIG
 from utils.transforms import up_interpolate

-def get_locations(output_h, output_w, device):
+def get_locations(output_h, output_w, device, num_joints):
     shifts_x = torch.arange(
         0, output_w, step=1,
         dtype=torch.float32, device=device
@@ -26,34 +26,21 @@ def get_locations(output_h, output_w, device):
         dtype=torch.float32, device=device
     )
     shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
-    shift_x = shift_x.reshape(-1)
-    shift_y = shift_y.reshape(-1)
-    locations = torch.stack((shift_x, shift_y), dim=1)
-
+    locations = torch.stack((shift_x, shift_y), dim=0).repeat(num_joints, 1, 1)
     return locations

-
-def get_reg_poses(offset, num_joints):
-    _, h, w = offset.shape
-    offset = offset.permute(1, 2, 0).reshape(h*w, num_joints, 2)
-    locations = get_locations(h, w, offset.device)
-    locations = locations[:, None, :].expand(-1, num_joints, -1)
-    poses = locations - offset
-
-    return poses
-
-
 def offset_to_pose(offset, flip=True, flip_index=None):
     num_offset, h, w = offset.shape[1:]
-    num_joints = int(num_offset/2)
-    reg_poses = get_reg_poses(offset[0], num_joints)
+    locations = get_locations(h, w, offset.device, num_offset // 2)
+    reg_poses = locations - offset[0]

     if flip:
-        reg_poses = reg_poses[:, flip_index, :]
-        reg_poses[:, :, 0] = w - reg_poses[:, :, 0] - 1
+        reg_poses = reg_poses.view(-1, 2, h, w)
+        reg_poses = reg_poses[flip_index, :, :]
+        reg_poses[:, 0, :] = w - reg_poses[:, 0, :] - 1
+        reg_poses = reg_poses.reshape(num_offset, h, w)

-    reg_poses = reg_poses.contiguous().view(h*w, 2*num_joints).permute(1,0)
-    reg_poses = reg_poses.contiguous().view(1,-1,h,w).contiguous()
+    reg_poses = reg_poses.unsqueeze(0)

     return reg_poses
 @@ -75,7 +65,7 @@ def get_multi_stage_outputs(cfg, model, image, with_flip=False):
                 for new dataset: %s.' % cfg.DATASET.DATASET)

         image = torch.flip(image, [3])
-        image[:, :, :, :-3] = image[:, :, :, 3:]
+        #image[:, :, :, :-3] = image[:, :, :, 3:]
         heatmap_flip, offset_flip = model(image)

         heatmap_flip = torch.flip(heatmap_flip, [3])
Deep-learning999 commented 3 years ago

hrnet is very cumbersome, is there a lightweight model that can be applied to this algorithm

wmcnally commented 2 years ago

Here is the speed on my TITAN Xp GPU (evaluated over the 5000 COCO validation images using a batch size of 1, including NMS, excluding rescoring network and heatmap matching):

DEKR32 (without TTA, 62.4 AP): 10.26 FPS DEKR48 (without TTA, 66.3 AP): 6.45 FPS DEKR32 (with TTA, 70.1 AP): 1.50 FPS DEKR48 (with TTA, 71.8 AP): 0.86 FPS

TTA: Test-time augmentation (including multi-scale at 0.5, 1, and 2, and flip)

lucasjinreal commented 2 years ago

that's too slow... with this speed I can apply a more accurate top-down model with higher accuracy....

wmcnally commented 2 years ago

@jinfagang check out KAPAO, it’s pretty fast.

xsacha commented 2 years ago

@wmcnally Thanks. That looks quite fast. The videos look very inaccurate (stolen joints) but I guess they are showing the small model. It can be resolved as this is usually post-process.