bardiadoosti / HOPE

Source code of CVPR 2020 paper, "HOPE-Net: A Graph-based Model for Hand-Object Pose Estimation"
270 stars 58 forks source link

Intrinsic/Extrinsic parameters between different datasets #22

Open hedjm opened 4 years ago

hedjm commented 4 years ago

Thank you for this great work, you said in your comment here(https://github.com/bardiadoosti/HOPE/issues/15#issuecomment-653090891_) "Here the Adaptive Graph U-Net is exactly learning this transformation for a very specific camera and angle condition." but in the paper, in the last paragraph of the introduction, you mentioned that you pretrained the 2D to 3D GraphUNET on synthetic data (Obman) which have a totally different intrinsic/extrinsic parameters, would you please clarify this?

Thank you again for your work.

bardiadoosti commented 4 years ago

Hi Mohamed, Yes GraphUNet's performance is conditioned on the camera parameters of a particular dataset and a model trained with one dataset will not work on another dataset with different conditions. But we should consider that pre-training does not have to be with a dataset with exactly the same conditions. First, Obman can be helpful because of transfer learning. In transfer learning the model is trained with a different objective function which does not directly relate to the desired task (e.g. using a model pre-trained with ImageNet for a completely different task). Also in this case, Obman helped more to pre-train the image encoder and helped us to get the better initial 2D estimations. Of course it may be helpful in the graph parts as well.

hedjm commented 4 years ago

@bardiadoosti Thank you for replying to my comment.

I was refering to this paragraph in the introduction: "we are not limited to training on only annotated real images, but can instead pretrain the 2D to 3D network separately with synthetic images rendered from 3D meshes of hands interacting with objects (e.g. ObMan dataset [10]).", What I don't agree about in this sentence is that you did not mention anything about pre-training the image encoder, what I understand is that you are only training the 2D to 3D network (GraphUnet). Also, when we do finetuning we need to be careful when choosing the learning rate, freeze some layers' weights, etc. Furthermore, as far as I know, the 2D in the Obman dataset are located in images of size 256x256 which is much smaller than the 2D in the FPAD dataset (you are not resizing the 2D to 224x224 in your case). Anyway, I am very interested in your amazing work and I like it, I find that you opened many opportunities for further improvements.

Thank you.