How is the paired/unpaired setting defined?

akanazawa / hmr

Project page for End-to-end Recovery of Human Shape and Pose

Other

1.55k stars 391 forks source link

How is the paired/unpaired setting defined? #132

Closed russoale closed 4 years ago

russoale commented 4 years ago

Hi @akanazawa,

thanks for the great work. I'm currently working on a TF2.0 implementation of your work in a Keras Model Based version. While evaluating, I am not sure which training setting I really have to use and therefore compare the performance witch the results published in the paper.

In Section 3. Model of the paper the unpaired setting is defined as

Additionally we assume that there is a pool of 3D meshes of human bodies of varying shape and pose. Since these meshes do not necessarily have a corresponding image, we refer to this data as unpaired [55].

But then later in Section 4.3. Without Paired 3D Supervision it is said to be

So far we have used paired 2D-to-3D supervision, i.e. L3D whenever available. Here we evaluate a model trained without any paired 3D supervision. We refer to this setting as HMR unpaired and report numerical results in all the tables.

Could you please clarify this?

Thanks

akanazawa commented 4 years ago

Hi,

Hope this clears it up:

Unpaired:

No image with corresponding ground truth 3D data is available/used during the training. The only 3D data is used for the discriminator (i.e. mocap data), which are not the 3D ground truth of the training images.

Paired:

Some of the training images come with corresponding ground truth 3D supervision, for example for the Human3.6M dataset, the 3D joint locations are available for each image.

For COCO or any in-the-wild human dataset without ground truth 3D, the only available option is unpaired. If you train on both COCO and Human3.6M, that is technically paired bc 3D is available for Human3.6M.

Best,

Angjoo

russoale commented 4 years ago

Thanks for the quick reply.

So in the unpaired setting the encoders loss will be calculated encoder_loss = kp2d_loss + encoder_disc_loss where encoder_disc_loss is still the combined encoder_theta (all IF loop predictions) + gt_theta (from CMU or jointLim)?

russoale commented 4 years ago

Hi @akanazawa,

I think I might have a pretty close implementation to yours based on TF 2.1 with keras. I just have two question:

the reported results for the unpaired setting in the paper was trained on only 2D ground truth for the Encoder but I assume that had H36M and MPII_3D included just without their 3D ground truth?
in your evaluation script you define two protocols. There you define trial_ids and cam_ids. https://github.com/akanazawa/hmr/blob/bce0ef9b90bd36871d2aff8688b2682170cd365a/src/benchmark/evaluate_h36m.py#L102-L107 Could you describe your pre-processing of H36M dataset, e.g. S11 with Scenario: Phoning contains only Phoning 2 and Phoning 3 CDF files. Does this mean they haven't been considered during evaluation?

Best regards!

akanazawa commented 4 years ago

Great!

Yes I believe I just experimented with the same dataset but used a flag that does not use any 3D ground truth for unpaired.
I forget about this pre-processing... Why can't the Phoning 2 and 3 not used for evaluation? I recall there may have been some known issue with one of the videos being corrupt but I'm not sure if this is related. Anyhow, the processing code of this dataset is available here and this is probably more insightful than my memory :).

Best,

Angjoo

russoale commented 4 years ago

Great, I will train again and then evaluate.
Good question. Phoning 2 and 3 should be included but as far as I can tell (without having looked at read_human36m.py) the trial_ids specifies the index of the given sequences. Unfortunatles the link your have provided results in a 404. I assume its a private repo. Could you maybe create a public gist?

Thanks for the support!

akanazawa commented 4 years ago

Oops I meant to link to this one

Thanks!