ViTAE-Transformer / ViTPose

The official repo for [NeurIPS'22] "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation" and [TPAMI'23] "ViTPose++: Vision Transformer for Generic Body Pose Estimation"
Apache License 2.0
1.27k stars 171 forks source link

Inconsistency in Reported Experimental Results for ViTPose and ViTPose++ Across Papers #129

Open Janus-Shiau opened 6 months ago

Janus-Shiau commented 6 months ago

Hello VitPose Team,

Firstly, I'd like to express my admiration for your work on ViTPose and ViTPose++. These are indeed significant achievements in the field of Human Pose Estimation with a pretty robust and awesome performance.

However, I am reaching out to inquire about an inconsistency in the reported experimental results reported across two of your papers: ViTPose and ViTPose++.

In the initial release of the ViTPose++ paper, I noticed that ViTPose appeared to perform better on the OCHuman dataset compared to ViTPose++. However, in the updated ViTPose++ paper from December 2023, new data was included for ViTPose in the OCHuman results, which differed from the original ViTPose data.

Specifically, when comparing Table 11 from the ViTPose paper with Table 15 from the ViTPose++ paper, there seems to be a discrepancy of over 25 AP in the ViTPose data, even though other SOTA data remains consistent.

image

image

Clarifying these differences is important for others in the field and would greatly enhance the understanding and application of your valuable work. Could you please provide some insight into this discrepancy?

Thank you for your time and effort.

Annbless commented 6 months ago

Hi, Thanks for your notice.

In the ViTPose paper, the results are reported using the multi-task training setting. In our ViTPose++ paper (Table 15), the ViTPose results indicate the model trained with a single dataset (MS COCO in this case).

Janus-Shiau commented 6 months ago

Hi @Annbless,

Thank you for your prompt and clear explanation. I realize now that I overlooked this detail when comparing the data. This clarification will surely be beneficial for anyone delving into the nuances of your impactful work.

Thanks again for your guidance and contributions to the field.

Reference

To assist others who might have similar questions in the future, I've extracted the relevant text from both papers.

From the ViTPose paper, section A:

"Please note that the ViTPose variants are trained under the multi-dataset training setting and tested directly without further finetuning on the specific training dataset, to keep the whole pipeline as simple as possible."

From the ViTPose++ paper, section 4.5.2:

"Note that the ViTPose++ models are trained with the combination of all the datasets and directly tested on the target dataset without further fine-tuning, which keeps the whole pipeline as simple as possible. For each dataset, we use the corresponding FFN, decoder, and prediction head in ViTPose for prediction. We also provide the ViTPose baseline results. It’s worth highlighting that, despite using the same number of parameters for inference, ViTPose++ utilizes much fewer parameters during training compared with training individual ViTPose models for each dataset."