Nicholasli1995 / EvoSkeleton

Official project website for the CVPR 2020 paper (Oral Presentation) "Cascaded deep monocular 3D human pose estimation wth evolutionary training data"
https://arxiv.org/abs/2006.07778
MIT License
333 stars 44 forks source link

Inconsistency between pretrained HRNet 2D detector and twoDPose_HRN.npy #73

Closed kyang-06 closed 2 years ago

kyang-06 commented 2 years ago

Hi, appreciate for your excellent work and comprehensively technical detail release! I believe this would be great effort to 3DHPE field.

One question. I encountered some issues when inference HRNet model (I mean the 2D detector) that loads the pretrained weight, given cropped h36m images.

  1. Accuracy in the first screenshot shows that the 2D average error (pixel) is around 7, which is inconsistent with reported 4.4.

  2. Meanwhile, I print the 2D pose prediction for the frame of (9, 'Directions', 'Directions 1.54138969.h5-sh') from the model inference result, and from released twoDPose_HRN_test.npy. The inconsistence appears again as shown in the uploaded 2nd image.

Could you help me get rid of the unexpected situation ? Did I miss something, or may you release another high-acc pretrained HR model ?

Many thanks !

Ref1 image

Nicholasli1995 commented 2 years ago

Hi, appreciate for your excellent work and comprehensively technical detail release! I believe this would be great effort to 3DHPE field.

One question. I encountered some issues when inference HRNet model (I mean the 2D detector) that loads the pretrained weight, given cropped h36m images.

  1. Accuracy in the first screenshot shows that the 2D average error (pixel) is around 7, which is inconsistent with reported 4.4.
  2. Meanwhile, I print the 2D pose prediction for the frame of (9, 'Directions', 'Directions 1.54138969.h5-sh') from the model inference result, and from released twoDPose_HRN_test.npy. The inconsistence appears again as shown in the uploaded 2nd image.

Could you help me get rid of the unexpected situation ? Did I miss something, or may you release another high-acc pretrained HR model ?

Many thanks !

Ref1 image

The released model was the one used to generate the 2D predictions. How did you pre-process the images before feeding them to the model? How about other sequences other than (9, 'Directions', 'Directions 1.54138969.h5-sh')? I think the difference may come from the way you crop the input patch.

kyang-06 commented 2 years ago

Thank you for soon reply! Yes, the crop strategy is a confusing point to me. Sorry for forgetting to mention

I first crop 1000x1002 (or 1000x1000) image into person bounding patch by ground truth bounding box provided by h36m official, then resize it into 384x288 with black border. Did I do it in the right way?

As for other sequences, I do not check the consistency. But as the mean 2D error is around 7 pixel on the whole test set, I guess they are in the similar situation.

kyang-06 commented 2 years ago

image Hi, I visualized the inference results. It seems that even easy pose is predicted at quite high error (e.g. T-pose prediction at 4pixel error).

The crop border size in the figure is some kind like the ones you provided in the instruction, so I guess the issue is not caused by cropping. In the other hand, as for the normalization, I scale the image from [0-255] to [0,1], and then normalize it by mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225].

Any hint could be helpful :) Appreciate

Nicholasli1995 commented 2 years ago

image Hi, I visualized the inference results. It seems that even easy pose is predicted at quite high error (e.g. T-pose prediction at 4pixel error).

The crop border size in the figure is some kind like the ones you provided in the instruction, so I guess the issue is not caused by cropping. In the other hand, as for the normalization, I scale the image from [0-255] to [0,1], and then normalize it by mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225].

Any hint could be helpful :) Appreciate

Your results do not seem correct to me either. Maybe you can paste your code here so that I can check your data pre-processing, model inference, and error calculation. Did you use the original code that uses an affine transformation warping? Showing your code can help me understand your process of "crop 1000x1002 (or 1000x1000) image into person bounding patch by ground truth bounding box provided by h36m official, then resize it into 384x288 with black border."

kyang-06 commented 2 years ago

Many thanks for continuous follow-up of this issue.

I have found one bug that previously I missed to run cv2.cvtColor(img, cv2.COLOR_BGR2RGB) when loading image samples. After fixing the bug, I got 2D error at 5.85, still worse than 4.4 but tolerable, as shown in the screenshot below: image

The result visualization is now like in the figure below. Does it look like normal in your view? image

Data process:

    def __getitem__(self, idx):
        image_file, db = tuple(self._db[idx])
        joints_3d = db['joint_label']
        bbox = db['bbox_2d'].astype(int)

        data_numpy = cv2.imread(image_file, 1 | 128)

        data_numpy = cv2.cvtColor(data_numpy, cv2.COLOR_BGR2RGB)

        if data_numpy is None:
            logger.error('=> fail to read {}'.format(image_file))
            raise ValueError('Fail to read {}'.format(image_file))

        #### Here: 1000x1000 -> cropped image with ground-truth bbox
        data_numpy = data_numpy[bbox[0,1]:bbox[1,1]+1, bbox[0,0]:bbox[1,0]+1]

        img_trans_mat = np.eye(3)
        img_trans_mat[:2, -1] = -bbox[0]

        joints_2d = db['joint_2d']
        joints_original = joints_2d.copy()
        joints_2d = joints_2d - bbox[0]
        joints_vis = np.ones(joints_2d.shape, dtype=np.float32)
        c, s = self._xywh2cs(0, 0, data_numpy.shape[1], data_numpy.shape[0])
        score = 1
        r = 0

        trans = get_affine_transform(c, s, r, self.image_size)
        input = cv2.warpAffine(
            data_numpy,
            trans,
            (int(self.image_size[0]), int(self.image_size[1])),
            flags=cv2.INTER_LINEAR)

        img_trans_mat = np.matmul(trans, img_trans_mat)
        img_trans_mat = np.concatenate([img_trans_mat, np.array([[0,0,1]])])

        if self.transform:
            input = self.transform(input)

        for i in range(self.num_joints):
            if joints_vis[i, 0] > 0.0:
                joints_2d[i, 0:2] = affine_transform(joints_2d[i, 0:2], trans)
                # set joints to in-visible if they are out-side of the image
                if joints_2d[i, 0] >= self.image_width or joints_2d[i, 1] >= self.image_height:
                    joints_vis[i, 0] = 0.0

        target, target_weight = self.generate_target(joints_2d, joints_vis)

        target = torch.from_numpy(target)
        target_weight = torch.from_numpy(target_weight)

        meta = {
            'image': image_file,
            'joints_2d': joints_2d,
            'joints_vis': joints_vis,
            'j_original_2d': joints_original,  # original coordinates
            'joints_3d': joints_3d,
            'center': c,
            'scale': s,
            'rotation': r,
            'score': score,
            'trans': img_trans_mat,     # 3x3
            'trans_inv': np.linalg.inv(img_trans_mat),
            'bbox': bbox
        }

        return input, target, target_weight, meta

Evaluation (I want to jointly train 2D+3D):

def validate(config, val_loader, val_dataset, model, criterion, output_dir,
                   tb_log_dir, writer_dict=None, total_batches=-1, save=False, split=None):
    batch_time = AverageMeter()
    losses = AverageMeter()
    acc = AverageMeter()
    error = AverageMeter()

    # switch to evaluate mode
    model.eval()

    num_iters = 0
    with torch.no_grad():
        end = time.time()
        for i, (input, target, target_weight, meta) in enumerate(val_loader):
            num_iters += 1
            if total_batches > 0 and num_iters > total_batches and not save:
                break
            batch_size = len(input)
            # compute output
            out_kpt_3d, out_kpt_2d, outputs, out_kpt_2d_orig = model(input, img_trans_mat_inv=meta['trans_inv'].float().cuda(), kpt_2d_gt=meta['j_original_2d'].float().cuda())

            if isinstance(outputs, list):
                output = outputs[-1]
            else:
                output = outputs

            target = target.cuda(non_blocking=True)
            target_weight = target_weight.cuda(non_blocking=True)
            target_3d = meta['joints_3d'].float().cuda() / 1.e3
            loss_3d = criterion['3d'](out_kpt_3d, target_3d)
            loss = loss_3d

            num_images = input.size(0)
            # measure accuracy and record loss
            losses.update(loss.item(), num_images)
            avg_acc = torch.norm(out_kpt_2d - meta['joints_2d'].cuda().float(), dim=-1).mean()
            acc.update(avg_acc, batch_size)

            err_cur = torch.norm((out_kpt_3d - target_3d) * 1e3, dim=-1).mean()
            error.update(err_cur, batch_size)

            batch_time.update(time.time() - end)
            end = time.time()
            if i % config.PRINT_FREQ == 0 or (i+1) == len(val_loader):
                msg = 'Test: [{0}/{1}]\t' \
                      'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t' \
                      'Loss {loss.val:.4f} ({loss.avg:.4f})\t' \
                      'Accuracy {acc.val:.3f} ({acc.avg:.3f})\t Error ({error.avg:.3f})'.format(
                    i, len(val_loader), batch_time=batch_time, acc=acc,
                    loss=losses, error=error)
                logger.info(msg)

                prefix = '{}_{}'.format(
                    os.path.join(output_dir, 'val'), i
                )

Model inference:


class JointTrainingModel(nn.Module):
    def __init__(self, cfg, is_train, **kwargs):
        super(JointTrainingModel, self).__init__()
        self.model_2d = PoseHighResolutionNet(cfg, **kwargs)
        self.model_3d = MyLiftingModel(cfg, **kwargs)
        self.re_order = [3, 12, 14, 16, 11, 13, 15, 1, 2, 0, 4, 5, 7, 9, 6, 8, 10]

        if is_train and cfg.MODEL.INIT_WEIGHTS:
            self.model_2d.init_weights(cfg.MODEL.PRETRAINED_HRNET)
            self.model_3d.init_weights(cfg.MODEL.PRETRAINED_LIFTING)

    def forward(self, x, img_trans_mat_inv):
        output_heatmaps = self.model_2d(x)
        kpt_2d, maxvals = get_max_preds_soft_pt(output_heatmaps)
        kpt_2d = kpt_2d[:, self.re_order]
        ### Here: 2D coordinate 384x288 -> 1000x1000
        kpt_2d_original = torch.bmm(img_trans_mat_inv, torch.nn.functional.pad(kpt_2d, (0,1), mode='constant', value=1.).transpose(-2, -1)).transpose(-2, -1)[:, :, :2]
        kpt_2d_normalized = (kpt_2d_original - 500.) / 500.
        out_kpt_3d = self.model_3d(kpt_2d_normalized)

        return out_kpt_3d, kpt_2d, output_heatmaps, kpt_2d_original

Appreciate in advance for any possible reason that comes to your mind. :)

Nicholasli1995 commented 2 years ago

out_kpt_2d

Hi, I notice you are computing joint distance in the local patch: avg_acc = torch.norm(out_kpt_2d - meta['joints_2d'].cuda().float(), dim=-1).mean()

In contrast, I compute such distances in the original image before affine transformation: https://github.com/Nicholasli1995/EvoSkeleton/blob/b2b355f4c1fa842709f100d931189ce80008f6ef/libs/hhr/core/evaluate.py#L120

Please use consistent code for evaluation.

kyang-06 commented 2 years ago

Thank you for the patient help! I will have a try. This issue got mainly solved, so I close it.