Closed Vegetebird closed 3 years ago
Hi~ The performance is worse than VideoPose with downsample as 1. This is because human3.6M dataset is captured at 50Hz, and there is a lot of redundant information between adjacent frames, which is not conducive to the extraction of spatial features. If we set the downsample to 5, it could help learn the diversity of the 3D pose spatial structure. In our results, we infer customed videos in the wild without downsampling videos.
Hi~
The paper trained the network with downsample as 5, but VideoPose downsample is 1. When I train the network with downsample as 1, the performance is wrose than VideoPose.
So I want to know what's the performance you trained with downsample as 1?