boycehbz / HumanInteraction

The code for CVPR 2024 paper "Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption"
MIT License
31 stars 1 forks source link

Image Feature Extraction Process in Hi4D Dataset #4

Open yisuanwang opened 1 month ago

yisuanwang commented 1 month ago

Thank you for your great work on the Hi4D dataset. I have been working with the dataset and specifically analyzing the test.pkl file for each motion case. After extracting the image features, I noticed that the shape of the features is (140, 2, 1024).

The first dimension (140) seems to correspond to the number of frames in each case of the Hi4D dataset. However, I am unclear about the meaning of the subsequent dimensions (2, 1024). In the paper, it is mentioned that: Specifically, we adopt a ViT [7] to extract image features for a single person, and then combine the image features and bounding-box information to regress SMPL parameters with a transformer decoder.

Could you please clarify how the (2, 1024) image features were extracted from the Hi4D images? Additionally, would it be possible to share the implementation code for this specific step so that I can test it on custom videos?

Thank you for your time and help!

boycehbz commented 1 month ago

(2, 1024) denotes 2 characters, and each character has a 1024-dimensional feature. You can follow the HMR2.0 (ICCV 2023) approach to train a ViT backbone on a single-person human mesh recovery task, and then use the backbone for feature extraction. I think it's also fine to directly use the pretrained model of the HMR2.0 backbone for extraction.