A Graph Attention Spatio-temporal Convolutional Networks for 3D Human Pose Estimation in Video (GAST-Net)

News

[2021/01/28] We update GAST-Net to able to generate 19-joint human poses including body and foot joints. [DEMO]
[2020/11/17] We provide a tutorial on how to generate 3D poses/animation from a custom video. [INFERENCE_EN.md]
[2020/10/15] We achieve online 3D skeleton-based action recognition with a single RGB camera. [video][code]
[2020/08/14] We achieve real-time 3D pose estimation. [video]

Introduction

Spatio-temporal information is key to resolve occlusion and depth ambiguity in 3D pose estimation. Previous methods have focused on either temporal contexts or local-to-global architectures that embed fixed-length spatio-temporal information. To date, there have not been effective proposals to simultaneously and flexibly capture varying spatio-temporal sequences and effectively achieves real-time 3D pose estimation. In this work, we improve the learning of kinematic constraints in the human skeleton: posture, local kinematic connections, and symmetry by modeling local and global spatial information via attention mechanisms. To adapt to single- and multi-frame estimation, the dilated temporal model is employed to process varying skeleton sequences. Also, importantly, we carefully design the interleaving of spatial semantics with temporal dependencies to achieve a synergistic effect. To this end, we propose a simple yet effective graph attention spatio-temporal convolutional network (GAST-Net) that comprises of interleaved temporal convolutional and graph attention blocks. Combined with the proposed method, we introduce a real-time strategy for online 3D skeleton-based action recognition with a simple RGB camera. Experiments on two challenging benchmark datasets (Human3.6M and HumanEva-I) and YouTube videos demonstrate that our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
A Graph Attention Spatio-temporal Convolutional Networks for 3D Human Pose Estimation in Video.
Project Website: http://www.juanrojas.net/gast/

FrameWork

Dependencies

Data preparation

Training & Testing

Download our pretrained models from model zoo(GoogleDrive or BaiduDrive (ietc))

Reconstruct 3D poses from 2D keypoints

Reconstruct 3D poses from 2D keypoints estimated from 2D detector (Mask RCNN, HRNet and OpenPose et al), and visualize it.

If you want to reproduce the baseball example (17 joints, only include body joints), please run the following code:

If you want to reproduce the baseball example (19 joints, include body and toe joints), please run the following code:

How to generate 3D human poses from a custom video

We provide a tutorial on how to run our model on custom videos. See INFERENCE.md for more details.

fabro66 / GAST-Net-3DPoseEstimation

readme

A Graph Attention Spatio-temporal Convolutional Networks for 3D Human Pose Estimation in Video (GAST-Net)

News

Introduction