Closed kyang-06 closed 2 years ago
Hi, appreciate for your excellent work and comprehensively technical detail release! I believe this would be great effort to 3DHPE field.
One question. I encountered some issues when inference HRNet model (I mean the 2D detector) that loads the pretrained weight, given cropped h36m images.
- Accuracy in the first screenshot shows that the 2D average error (pixel) is around 7, which is inconsistent with reported 4.4.
- Meanwhile, I print the 2D pose prediction for the frame of (9, 'Directions', 'Directions 1.54138969.h5-sh') from the model inference result, and from released twoDPose_HRN_test.npy. The inconsistence appears again as shown in the uploaded 2nd image.
Could you help me get rid of the unexpected situation ? Did I miss something, or may you release another high-acc pretrained HR model ?
Many thanks !
![]()
The released model was the one used to generate the 2D predictions. How did you pre-process the images before feeding them to the model? How about other sequences other than (9, 'Directions', 'Directions 1.54138969.h5-sh')? I think the difference may come from the way you crop the input patch.
Thank you for soon reply! Yes, the crop strategy is a confusing point to me. Sorry for forgetting to mention
I first crop 1000x1002 (or 1000x1000) image into person bounding patch by ground truth bounding box provided by h36m official, then resize it into 384x288 with black border. Did I do it in the right way?
As for other sequences, I do not check the consistency. But as the mean 2D error is around 7 pixel on the whole test set, I guess they are in the similar situation.
Hi, I visualized the inference results.
It seems that even easy pose is predicted at quite high error (e.g. T-pose prediction at 4pixel error).
The crop border size in the figure is some kind like the ones you provided in the instruction, so I guess the issue is not caused by cropping. In the other hand, as for the normalization, I scale the image from [0-255] to [0,1], and then normalize it by mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225].
Any hint could be helpful :) Appreciate
Hi, I visualized the inference results. It seems that even easy pose is predicted at quite high error (e.g. T-pose prediction at 4pixel error).
The crop border size in the figure is some kind like the ones you provided in the instruction, so I guess the issue is not caused by cropping. In the other hand, as for the normalization, I scale the image from [0-255] to [0,1], and then normalize it by mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225].
Any hint could be helpful :) Appreciate
Your results do not seem correct to me either. Maybe you can paste your code here so that I can check your data pre-processing, model inference, and error calculation. Did you use the original code that uses an affine transformation warping? Showing your code can help me understand your process of "crop 1000x1002 (or 1000x1000) image into person bounding patch by ground truth bounding box provided by h36m official, then resize it into 384x288 with black border."
Many thanks for continuous follow-up of this issue.
I have found one bug that previously I missed to run cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
when loading image samples.
After fixing the bug, I got 2D error at 5.85, still worse than 4.4 but tolerable, as shown in the screenshot below:
The result visualization is now like in the figure below. Does it look like normal in your view?
Data process:
def __getitem__(self, idx):
image_file, db = tuple(self._db[idx])
joints_3d = db['joint_label']
bbox = db['bbox_2d'].astype(int)
data_numpy = cv2.imread(image_file, 1 | 128)
data_numpy = cv2.cvtColor(data_numpy, cv2.COLOR_BGR2RGB)
if data_numpy is None:
logger.error('=> fail to read {}'.format(image_file))
raise ValueError('Fail to read {}'.format(image_file))
#### Here: 1000x1000 -> cropped image with ground-truth bbox
data_numpy = data_numpy[bbox[0,1]:bbox[1,1]+1, bbox[0,0]:bbox[1,0]+1]
img_trans_mat = np.eye(3)
img_trans_mat[:2, -1] = -bbox[0]
joints_2d = db['joint_2d']
joints_original = joints_2d.copy()
joints_2d = joints_2d - bbox[0]
joints_vis = np.ones(joints_2d.shape, dtype=np.float32)
c, s = self._xywh2cs(0, 0, data_numpy.shape[1], data_numpy.shape[0])
score = 1
r = 0
trans = get_affine_transform(c, s, r, self.image_size)
input = cv2.warpAffine(
data_numpy,
trans,
(int(self.image_size[0]), int(self.image_size[1])),
flags=cv2.INTER_LINEAR)
img_trans_mat = np.matmul(trans, img_trans_mat)
img_trans_mat = np.concatenate([img_trans_mat, np.array([[0,0,1]])])
if self.transform:
input = self.transform(input)
for i in range(self.num_joints):
if joints_vis[i, 0] > 0.0:
joints_2d[i, 0:2] = affine_transform(joints_2d[i, 0:2], trans)
# set joints to in-visible if they are out-side of the image
if joints_2d[i, 0] >= self.image_width or joints_2d[i, 1] >= self.image_height:
joints_vis[i, 0] = 0.0
target, target_weight = self.generate_target(joints_2d, joints_vis)
target = torch.from_numpy(target)
target_weight = torch.from_numpy(target_weight)
meta = {
'image': image_file,
'joints_2d': joints_2d,
'joints_vis': joints_vis,
'j_original_2d': joints_original, # original coordinates
'joints_3d': joints_3d,
'center': c,
'scale': s,
'rotation': r,
'score': score,
'trans': img_trans_mat, # 3x3
'trans_inv': np.linalg.inv(img_trans_mat),
'bbox': bbox
}
return input, target, target_weight, meta
Evaluation (I want to jointly train 2D+3D):
def validate(config, val_loader, val_dataset, model, criterion, output_dir,
tb_log_dir, writer_dict=None, total_batches=-1, save=False, split=None):
batch_time = AverageMeter()
losses = AverageMeter()
acc = AverageMeter()
error = AverageMeter()
# switch to evaluate mode
model.eval()
num_iters = 0
with torch.no_grad():
end = time.time()
for i, (input, target, target_weight, meta) in enumerate(val_loader):
num_iters += 1
if total_batches > 0 and num_iters > total_batches and not save:
break
batch_size = len(input)
# compute output
out_kpt_3d, out_kpt_2d, outputs, out_kpt_2d_orig = model(input, img_trans_mat_inv=meta['trans_inv'].float().cuda(), kpt_2d_gt=meta['j_original_2d'].float().cuda())
if isinstance(outputs, list):
output = outputs[-1]
else:
output = outputs
target = target.cuda(non_blocking=True)
target_weight = target_weight.cuda(non_blocking=True)
target_3d = meta['joints_3d'].float().cuda() / 1.e3
loss_3d = criterion['3d'](out_kpt_3d, target_3d)
loss = loss_3d
num_images = input.size(0)
# measure accuracy and record loss
losses.update(loss.item(), num_images)
avg_acc = torch.norm(out_kpt_2d - meta['joints_2d'].cuda().float(), dim=-1).mean()
acc.update(avg_acc, batch_size)
err_cur = torch.norm((out_kpt_3d - target_3d) * 1e3, dim=-1).mean()
error.update(err_cur, batch_size)
batch_time.update(time.time() - end)
end = time.time()
if i % config.PRINT_FREQ == 0 or (i+1) == len(val_loader):
msg = 'Test: [{0}/{1}]\t' \
'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t' \
'Loss {loss.val:.4f} ({loss.avg:.4f})\t' \
'Accuracy {acc.val:.3f} ({acc.avg:.3f})\t Error ({error.avg:.3f})'.format(
i, len(val_loader), batch_time=batch_time, acc=acc,
loss=losses, error=error)
logger.info(msg)
prefix = '{}_{}'.format(
os.path.join(output_dir, 'val'), i
)
Model inference:
class JointTrainingModel(nn.Module):
def __init__(self, cfg, is_train, **kwargs):
super(JointTrainingModel, self).__init__()
self.model_2d = PoseHighResolutionNet(cfg, **kwargs)
self.model_3d = MyLiftingModel(cfg, **kwargs)
self.re_order = [3, 12, 14, 16, 11, 13, 15, 1, 2, 0, 4, 5, 7, 9, 6, 8, 10]
if is_train and cfg.MODEL.INIT_WEIGHTS:
self.model_2d.init_weights(cfg.MODEL.PRETRAINED_HRNET)
self.model_3d.init_weights(cfg.MODEL.PRETRAINED_LIFTING)
def forward(self, x, img_trans_mat_inv):
output_heatmaps = self.model_2d(x)
kpt_2d, maxvals = get_max_preds_soft_pt(output_heatmaps)
kpt_2d = kpt_2d[:, self.re_order]
### Here: 2D coordinate 384x288 -> 1000x1000
kpt_2d_original = torch.bmm(img_trans_mat_inv, torch.nn.functional.pad(kpt_2d, (0,1), mode='constant', value=1.).transpose(-2, -1)).transpose(-2, -1)[:, :, :2]
kpt_2d_normalized = (kpt_2d_original - 500.) / 500.
out_kpt_3d = self.model_3d(kpt_2d_normalized)
return out_kpt_3d, kpt_2d, output_heatmaps, kpt_2d_original
Appreciate in advance for any possible reason that comes to your mind. :)
out_kpt_2d
Hi, I notice you are computing joint distance in the local patch: avg_acc = torch.norm(out_kpt_2d - meta['joints_2d'].cuda().float(), dim=-1).mean()
In contrast, I compute such distances in the original image before affine transformation: https://github.com/Nicholasli1995/EvoSkeleton/blob/b2b355f4c1fa842709f100d931189ce80008f6ef/libs/hhr/core/evaluate.py#L120
Please use consistent code for evaluation.
Thank you for the patient help! I will have a try. This issue got mainly solved, so I close it.
Hi, appreciate for your excellent work and comprehensively technical detail release! I believe this would be great effort to 3DHPE field.
One question. I encountered some issues when inference HRNet model (I mean the 2D detector) that loads the pretrained weight, given cropped h36m images.
Accuracy in the first screenshot shows that the 2D average error (pixel) is around 7, which is inconsistent with reported 4.4.
Meanwhile, I print the 2D pose prediction for the frame of (9, 'Directions', 'Directions 1.54138969.h5-sh') from the model inference result, and from released twoDPose_HRN_test.npy. The inconsistence appears again as shown in the uploaded 2nd image.
Could you help me get rid of the unexpected situation ? Did I miss something, or may you release another high-acc pretrained HR model ?
Many thanks !