Walter0807 / MotionBERT

[ICCV 2023] PyTorch Implementation of "MotionBERT: A Unified Perspective on Learning Human Motion Representations"
Apache License 2.0
1.02k stars 124 forks source link

how to get the action recognition result for custom videos #23

Closed Xinxinatg closed 1 year ago

Xinxinatg commented 1 year ago

Hey thanks for this wonderful work, the performance of 2D-3D recontruction is just eye-openning. I am just wondering whether the action recognition inference code for custom video is released yet, I can only find the evaluation code for action recognition which is meant for NTU-RGBD dataset.

Walter0807 commented 1 year ago

Hi, thanks for your interest in our work! We did not release that because the action categories are limited by the definition of NTU-RGB+D, so it would not be generally applicable. Nonetheless, it should be easy to implement given the WildDetDataset class. You can combine the data part in infer_wild.py and the inference part in train_action.py for that.

Xinxinatg commented 1 year ago

Thanks for your prompt reply! I got another question, the representation is obtained in this way '''
x: 2D skeletons type = <class 'torch.Tensor'> shape = [batch size frames joints(17) * channels(3)]

MotionBERT: pretrained human motion encoder type = <class 'lib.model.DSTformer.DSTformer'>

E: encoded motion representation type = <class 'torch.Tensor'> shape = [batch size frames joints(17) * channels(512)] ''' As I see from this code, only 2d coordindates information are used to extract the representation. But from what I understand, the action recognition is based on the reconstructed 3d skeleton which is obtained from 2d skeleton and the video. I am confused why the video information is not needed to extract the representation.

Walter0807 commented 1 year ago

The motion representation is based on the 2D skeletons, no need to estimate the 3D skeletons explicitly. The video information is used when extracting the 2D skeletons.