facebookresearch / VideoPose3D

Efficient 3D human pose estimation in video using 2D keypoint trajectories
Other
3.73k stars 755 forks source link

Further data augmentation in training? #103

Closed ahmetnarman closed 4 years ago

ahmetnarman commented 4 years ago

Hello,

First of all, thanks a lot for creating this work and making it open source!

In the training, you apply a horizontal flip to improve generalization performance. Why not apply other types of augmentation? I was wondering if augmentation techniques such as applying translation and scaling in the normalized image space can be useful?

For H3.6M dataset, this may not cause a big improvement in the accuracy because all subjects perform their actions in the same space with relatively similar body scales, ie. the training and testing data already have similar distributions in image space. However, for the videos in the wild where the subject scale and the position in the frame are not that regular, other augmentation techniques could be used to make the network invariant to such changes in data distrubution.

dariopavllo commented 4 years ago

Hi,

Doing data augmentation in the way you describe may actually degrade the performance of the model. Since people are relatively close to the camera, there is some degree of perspective distortion (foreshortening) that depends on the position of the person on the screen. The same applies to scaling (far keypoints are scaled towards the center of the screen). This information is actually exploited by the model to make a more accurate prediction, and was one of our arguments for using a coordinate-based model instead of a convolutional model (which is shift-invariant and therefore cannot easily deal with perspective distortion). Horizontal flipping is a safe transformation. Other safe transformations include varying non-linear lens distortion (which can be applied as a post-processing step on the image).

If you assume that people are always far away from the camera, you can also assume an orthografic camera model and then it makes sense to perform scaling/translation, but at that point you may as well normalize the poses to be always centered.

Can't say much about performance in the wild. If the perturbations are small, it may work just fine (empirically speaking).

ahmetnarman commented 4 years ago

Now it makes more sense, thank you for the reply.