IDEA-Research / DWPose

"Effective Whole-body Pose Estimation with Two-stages Distillation" (ICCV 2023, CV4Metaverse Workshop)
Apache License 2.0
2.27k stars 144 forks source link

hand postprocess? #16

Closed skeletonNN closed 1 year ago

skeletonNN commented 1 year ago

Thank you for your project!

I run the demo. The result is very perfect! But when i run the opencv_onnx brach, i get the different result. Can you do something hand postprocess?

skeletonNN commented 1 year ago

Such as I run the 10s video, the demo is get 20s result. But the opencv_onnx get the 10s? Why do this.

yzd-v commented 1 year ago

If you run a video, the input will be all the frames. Then these frames are combined into a video. The total frames should keep the same. Maybe the frame rate is different ? For the same image input, the results of onnx and opencv_onnx keep the same ?

skeletonNN commented 1 year ago

I run the same video, in your demo for this [https://openxlab.org.cn/apps/detail/mmpose/RTMPose], input is 10s video the output is 20s video. And I run the opencv_onnx, the result is bad, but maybe the pth result is good, i will test. I want to know the url link [https://openxlab.org.cn/apps/detail/mmpose/RTMPose] do something for poseprocess? The link demo result is good.

skeletonNN commented 1 year ago

The opencv_onnx the result. is bad.

image

The demo result. is perfect. and the result is 20s.

image
skeletonNN commented 1 year ago

So i want to know the demo is do otherthing? Maybe the opencv_onnx i run in cpu not gpu?

yzd-v commented 1 year ago

How do you use opencv_onnx for inference ?

skeletonNN commented 1 year ago

Use the controlenet script.

class Wholebody:
    def __init__(self, onnx_pose, device):
        backend = cv2.dnn.DNN_BACKEND_OPENCV if device == 'cpu' else cv2.dnn.DNN_BACKEND_CUDA
        providers = cv2.dnn.DNN_TARGET_CPU if device == 'cpu' else cv2.dnn.DNN_TARGET_CUDA

        self.session_pose = cv2.dnn.readNetFromONNX(onnx_pose)
        self.session_pose.setPreferableBackend(backend)
        self.session_pose.setPreferableTarget(providers)

    def __call__(self, oriImg, det_result):
        keypoints, scores = inference_pose(self.session_pose, det_result, oriImg)
        keypoints_info = np.concatenate(
            (keypoints, scores[..., None]), axis=-1)
        return keypoints_info

class DWposeDetector:
    def __init__(self, onnx_pose, device):
        self.pose_estimation = Wholebody(onnx_pose, device)

    def __call__(self, oriImg, det_results):
        oriImg = oriImg.copy()
        with torch.no_grad():
            candidate, subset = self.pose_estimation(oriImg, det_results)
            return candidate, subset

class HandPoseModel(object):
    def __init__(self, device):
        self.dwprocessor = DWposeDetector(ckpt_path)

    def predict_hand_pose(self, image, det_results):
        keypoints, score = self.dwprocessor(image, det_results)
        score = np.expand_dims(score, axis=2)
        dwpose_results = np.concatenate((keypoints, score), axis=2)
        return dwpose_results

if __name__ == "__main__":
    cpm = HandPoseModel()
    detector = Detection()
    cap = cv2.VideoCapture("./test.mp4")
    fps = cap.get(cv2.CAP_PROP_FPS)
    num_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    for frame_id in range(num_frames):
        ret, frame = cap.read()

        pred_bbox = detector(frame)
        joint2d = cpm.predict_hand_pose(frame, pred_bbox)
yzd-v commented 1 year ago

You need to detect and crop the person with yolox first.

skeletonNN commented 1 year ago

I use the vitdet for det.

class Detection(object):
    def __init__(self):
        print("Loading Detection model...")
        from detectron2.config import LazyConfig
        cfg_path ='configs'/'cascade_mask_rcnn_vitdet_h_75ep.py'
        self.detectron2_cfg = LazyConfig.load(str(cfg_path))
        self.detectron2_cfg.train.init_checkpoint = 'https://dl.fbaipublicfiles.com/detectron2/ViTDet/COCO/cascade_mask_rcnn_vitdet_h/f328730692/model_final_f05665.pkl'
        for i in range(3):
            self.detectron2_cfg.model.roi_heads.box_predictors[i].test_score_thresh = 0.5
        self.detector = DefaultPredictor_Lazy(self.detectron2_cfg)

    def get_detections(self, image):  
        outputs     = self.detector(image)   
        instances   = outputs['instances']
        instances   = instances[instances.pred_classes==0]
        instances   = instances[instances.scores>0.5]

        pred_bbox   = instances.pred_boxes.tensor.cpu().numpy()
        pred_masks  = instances.pred_masks.cpu().numpy()
        pred_scores = instances.scores.cpu().numpy()
        pred_classes= instances.pred_classes.cpu().numpy()

        return pred_bbox
yzd-v commented 1 year ago

It looks strange. Cause the opencv_onnx has been used by others and the results are good. You can try to use yolox_onnx with our code to debug first. I guess the problem may be the det bbox.

skeletonNN commented 1 year ago

Hello, The onnx model is fp16 or fp32 or int?

yzd-v commented 1 year ago

fp32

skeletonNN commented 1 year ago

wuwuwu, i don't know the result is different with you. image

image

I use the yolo detect. May i get the video is wrong?

class Wholebody:
    def __init__(self):
        device = 'cuda:0'
        providers = ['CPUExecutionProvider'
                 ] if device == 'cpu' else ['CUDAExecutionProvider']

        onnx_det = 'dwpose/yolox_l.onnx'
        onnx_pose = 'dwpose/dw-ll_ucoco_384.onnx'

        self.session_det = ort.InferenceSession(path_or_bytes=onnx_det, providers=providers)
        self.session_pose = ort.InferenceSession(path_or_bytes=onnx_pose, providers=providers)

    def __call__(self, oriImg):
        det_result = inference_detector(self.session_det, oriImg)
        keypoints, scores = inference_pose(self.session_pose, det_result, oriImg)

        keypoints_info = np.concatenate(
            (keypoints, scores[..., None]), axis=-1)

        return keypoints_info

class DWposeDetector:
    def __init__(self):
        self.pose_estimation = Wholebody()

    def __call__(self, oriImg):
        oriImg = oriImg.copy()
        with torch.no_grad():
            keypoints_info = self.pose_estimation(oriImg)
            return keypoints_info

class HandPoseModel(object):
    def __init__(self):
        self.dwprocessor = DWposeDetector()

    def predict_hand_pose(self, image):
        keypoints = self.dwprocessor(image)
        return keypoints

cap = cv2.VideoCapture("./test_hand3.mp4")
fps = cap.get(cv2.CAP_PROP_FPS)
num_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
cpm = HandPoseModel()
for frame_id in range(num_frames):
        ret, frame = cap.read()
        joint2d = cpm.predict_hand_pose(frame)
        for dwpose_out in joint2d[:1, :]:
                    dwpose_2d = np.zeros([67, 3])
                    dwpose_2d[
                        [0, 16, 15, 18, 17, 5, 2, 6, 3, 7, 4, 12, 9, 13, 10, 14, 11, 19, 20, 21, 22, 23, 24]
                    ] = dwpose_out[:23]
                    dwpose_2d[25:] = dwpose_out[91:133]
                    img = vis_keypoints(dwpose_2d, (frame.shape[1], frame.shape[0]), image=frame, dataset='openpose')
                    cv2.imwrite(f"./box_visual/{frame_id+1}.jpg", img[:,:,:3])
skeletonNN commented 1 year ago

OK i get the reason. wuwuwu i'm very cai.

yzd-v commented 1 year ago

So what's the reason ( ゚皿゚)

skeletonNN commented 1 year ago

take the frame to frame[:,;,::-1]