Image Feature Extraction Process in Hi4D Dataset

Thank you for your great work on the Hi4D dataset. I have been working with the dataset and specifically analyzing the test.pkl file for each motion case. After extracting the image features, I noticed that the shape of the features is (140, 2, 1024).

The first dimension (140) seems to correspond to the number of frames in each case of the Hi4D dataset. However, I am unclear about the meaning of the subsequent dimensions (2, 1024). In the paper, it is mentioned that: Specifically, we adopt a ViT [7] to extract image features for a single person, and then combine the image features and bounding-box information to regress SMPL parameters with a transformer decoder.

Could you please clarify how the (2, 1024) image features were extracted from the Hi4D images? Additionally, would it be possible to share the implementation code for this specific step so that I can test it on custom videos?

Thank you for your time and help!

boycehbz / HumanInteraction

Image Feature Extraction Process in Hi4D Dataset #4