fabro66 / GAST-Net-3DPoseEstimation

A Graph Attention Spatio-temporal Convolutional Networks for 3D Human Pose Estimation in Video (GAST-Net)
MIT License
312 stars 70 forks source link

Questions about downsample #26

Closed Vegetebird closed 3 years ago

Vegetebird commented 3 years ago

Hi~

The paper trained the network with downsample as 5, but VideoPose downsample is 1. When I train the network with downsample as 1, the performance is wrose than VideoPose.

So I want to know what's the performance you trained with downsample as 1?

fabro66 commented 3 years ago

Hi~ The performance is worse than VideoPose with downsample as 1. This is because human3.6M dataset is captured at 50Hz, and there is a lot of redundant information between adjacent frames, which is not conducive to the extraction of spatial features. If we set the downsample to 5, it could help learn the diversity of the 3D pose spatial structure. In our results, we infer customed videos in the wild without downsampling videos.