Pose Estimation Model Metrics: Inconsistencies between PeekingDuck Docs and TensorFlow Website

saifkhichi96 commented 1 year ago

Hello PeekingDuck team,

I have recently explored the PeekingDuck framework and noticed some inconsistencies in the reported metrics for pose estimation models, particularly MoveNet and PoseNet when comparing them with the results on the TensorFlow website.

In the PeekingDuck documentation (link), the Average Precision (AP) for MoveNet is stated as 7.3. However, the TensorFlow website (link) indicates an AP of 57.4 for even the quantized version of the same model. This difference suggests that the model's average precision in PeekingDuck is significantly lower than expected, but you state in your docs that "The evaluation metrics have been compared with the original repository of the respective pose estimation models for consistency." Which "original repository" were metrics for PoseNet and MoveNet compared with for consistency?

Could you kindly provide some insights into this discrepancy? Are there any variations in the evaluation setup or methods that might account for this substantial difference in reported metrics? Understanding the reasons behind these contrasting results is crucial for accurately assessing the performance of the models implemented in PeekingDuck.

Thank you for your help and support!

ongtw commented 1 year ago

Hi, the metrics are different because the Tensorflow website states that "Accuracy (mAP) numbers are measured on a subset of the COCO dataset in which we filter and crop each image to contain only one person" --- see attached screenshot below:

Whereas PeekingDuck's reported AP is across the entire COCO dataset (not just a subset), including images with multiple persons (not just one person). So the AP numbers from PeekingDuck's docs and the mAP numbers from Tensorflow website cannot be directly compared.

However, if you look at the relative numbers, the MoveNet model is indeed better than the PoseNet model.

saifkhichi96 commented 1 year ago

But wouldn't it be better if the results were reported in a similar way to TensorFlow website? It is my understanding that it is common practice for most human pose estimation models to use a detector first to crop out the people and then compute the accuracy.

aisingapore / PeekingDuck

Pose Estimation Model Metrics: Inconsistencies between PeekingDuck Docs and TensorFlow Website #750