Reproducing results of Table 7 in paper (3D hand pose estimation)!

NVlabs / dex-ycb-toolkit

A Python package that provides evaluation and visualization tools for the DexYCB dataset

https://dex-ycb.github.io

GNU General Public License v3.0

145 stars 24 forks source link

Reproducing results of Table 7 in paper (3D hand pose estimation)! #14

Closed nitba closed 2 years ago

nitba commented 2 years ago

Thanks for your great work. @noirmist @ychao-nvidia To use your dataset, I could not get your baseline ( a supervised version of Spurr et al. [31]) results on DexYCB. I would be thankful for your clarification .

1) In paper of [31], It is mentioned that input image to the baseline [31] is a 128 × 128 RGB image is cropped around the bounding box, have you done that in same way? (i.e For table 7 cropped images are input to the baseline ?)

2) if the answer of Q1 is "yes", how have you report the "absolute error" ? reported absolute error is meaningful when the input image is not cropped!

3) My absolute error is around 100 mm (input images are not cropped around the bounding-box), I do not have any idea how you have reached to "50 mm".

nitba commented 2 years ago

Hi @zc-alexfan

Do you have any idea about my question?

https://github.com/NVlabs/dex-ycb-toolkit/issues/3#issue-897096306

zc-alexfan commented 2 years ago

Sorry, I did not use it.

nitba commented 2 years ago

Hi @umariqb,

I would appreciate your comments on my issue.

ychao-nvidia commented 2 years ago

Answers to your questions:

Yes, we followed [31] for the baselines reported in Tab. 7. Therefore, yes, the input to the network is a 128 x 128 RGB image cropped around the bounding box.
We report absolute error by computing the 3D distance between the ground-truth and predicted joint positions.

Two additional comments regarding [31] and absolute error:

[31] predicts a special "2.5D representation" (see [18]) for the hand pose, and then uses the bounding box coordinates and the camera intrinsics to convert this "2.5D representation" to the 3D pose. That said, the task of the network is only to predict (1) the 2D locations of keypoints within the input image and (2) the root-relative depths of keypoints (see [18]), which is reasonable for a cropped image input. With this "2.5D representation", converting to 3D pose is well-posed (see [18]).
With that said, the input for this benchmark (for RGB-only) should really be (1) the full RGB image, (2) the bounding box coordinates, and (3) the camera intrinsics---not just the cropped image itself. You can see [31] as using (1) and (2) to get the input of their network (i.e., the cropped image).

nitba commented 2 years ago

Thanks for your comments, @ychao-nvidia . The paper did not mention that for For each test sample, you assume to have GT bounding box coordinates and camera intrinsic to report the absolute errors.
Regarding 100 mm absolute error, I did my experiments assuming that I do not have bounding boxes for test samples!

namepllet commented 2 years ago

Hi @ychao-nvidia I'd like to compare my results on Dex YCB dataset to your results(Table 7, 3D hand pose estimation)

It seems ground truth bounding box for hand is not available in test time,

so what bounding box (maybe using 2D joints coordinates or detected bounding box ... ?) did you use when crop image for hand in test time ?

ychao-nvidia commented 2 years ago

As mentioned above, for hand pose estimation we assume the bounding box is given at test time.

We calculated a tight bounding box by [min(X), min(Y), max(X), max(Y)], where X (Y) is the 2D x (y) coordinates of the 21 hand joints provided in the ground truths.

We then cropped a square image region (1) with a center shared with this tight bounding box and (2) with a side length of l, where 0.7*l=max(w, h) and w and h are the width and height of the tight bounding box. We used this cropped image region as the input to our network.

namepllet commented 2 years ago

Thanks for clear comments!