I have some questions about your results on Table 2 of your paper

facebookresearch / PoseWarper

Learning Temporal Pose Estimation from Sparsely Labeled Videos

Apache License 2.0

125 stars 15 forks source link

I have some questions about your results on Table 2 of your paper #21

Closed ArchNew closed 4 years ago

ArchNew commented 4 years ago

According to your paper, your detector comes from 3d mask rcnn. In that paper, the detector is pretrained on COCO and finetuned on PoseTrack 2017. The detector HRNet and Simple Baselines (Bin.X et.al.) used is not finetuned on PoseTrack 2017. I did an experiment, fine-tuning faster rcnn on PoseTrack 2017 as the human detector, then running Simple Baselines with the finetuned human detector's results. I got similar results (81.1 mean mAP). I doubt your "pose aggregation" helped much.

gberta commented 4 years ago

First of all, we don't use a 3D Mask R-CNN. Second, I tried to use the human detector that was pretrained on PoseTrack, but it actually worked worse than using COCO human detector. Third, the temporal pose aggregation is just one of the applications that we show in the paper. It doesn't bring huge accuracy gains but from our experiments the gains of about 0.5-1.0% were very consistent across our experiments. Using this codebase you should actually be getting better results than the ones reported in Table 2 (I made some improvements after the paper submission). Therefore, you should be getting >82 mAP using the temporal pose aggregation.

ArchNew commented 4 years ago

My apology for any confusion. By "3D Mask RCNN", I mean the paper, which is "Detect-and-Track: Efficient Pose Estimation in Videos". According to your paper, and I quote "During testing, we follow the same two-stage framework used in [27,23]: first, we detect the bounding boxes for each person in the image using the detector in [48], and then feed the cropped images to our pose estimation model." and also according to your paper, [48] refers to "Detect-and-Track: Efficient Pose Estimation in Videos". So I arrived at the conclusion that your paper uses "3D Mask RCNN".

You say you don't use a 3D Mask R-CNN, so I'm curious, how do you prepare your human detector? Just initialize the 3D MASK R-CNN on COCO? But...you don't use a 3D Mask R-CNN. I'm so confused now. Could you kindly tell us which detector do you use? I'm sorry to bother you again.

gberta commented 4 years ago

I think the paper in [48] is written in a very confusing matter. They claim that they are using 3D Mask R-CNN throughout the entire paper, but if you read carefully they are actually using a 2D Mask R-CNN for most of their experiments because the 3D Mask R-CNN doesn't fit in the memory. They only conduct very small scale experiments with a 3D Mask R-CNN at the end. I don't know why the authors presented it in such a misleading fashion (probably to try to convince the reviewers that there's novelty in their method). We use a standard 2D Mask R-CNN ResNet-101 model as our human detector.

ArchNew commented 4 years ago

Thanks for the details! I'm sorry for holding you responsible for the paper. It seems you are one of the developers for the code but not the paper.