ai4r / Gesture-Generation-from-Trimodal-Context

Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity (SIGGRAPH Asia 2020)
Other
245 stars 35 forks source link

mean_dir_vec and mean_pose #36

Closed YoungSeng closed 2 years ago

YoungSeng commented 2 years ago

Hello!

How to get the the 3d joint positions of each frame "pose_seq" and the vectors of bones "vec_seq" from .bvh? And how to get the mean_dir_vec and mean_pose in config, there seem only data_mean and data_std in process on make LMDB?

youngwoo-yoon commented 2 years ago

Hello, You can use PyMo to convert BVH to 3D joint position data. And make datasets in LMDB format. This is make_lmdb script for TED dataset: https://gist.github.com/youngwoo-yoon/0d5ae4e375aba9df10e75805bdf60ddd for your reference; you need to make your own version.

Mean_pose is the mean of all 3d poses in the training set. Mean_dir_vec is the mean of all directional vectors, which is the output of convert_pose_seq_to_dir_vec.

anddyhzw commented 2 years ago

Hello! Could you please send me the code contains AudioWrapper? I use VideoPose3D to convert 2D poses into 3D poses successfully,but I still need to extract audio feature to finish making train dataset. Or Could you publish the entire 3D preprocessing code? I want to produce chinese dataset reference the data process. Thank you very much!

youngwoo-yoon commented 2 years ago

Hello,

This is AudioWrapper which was missing in the above code. There are audio features (melspectrogram features, specifically), but I never used them in this study. The audio encoder network got raw audio signals as input.

class AudioWrapper:
    def __init__(self, filepath):
        self.y, self.sr = librosa.load(filepath, mono=True, sr=16000, res_type='kaiser_fast')
        self.n = len(self.y)

    def extract_audio_feat(self, video_total_frames, video_start_frame, video_end_frame):
        # roi
        start_frame = math.floor(video_start_frame / video_total_frames * self.n)
        end_frame = math.ceil(video_end_frame / video_total_frames * self.n)
        y_roi = self.y[start_frame:end_frame]

        # feature extraction
        melspec = librosa.feature.melspectrogram(
            y=y_roi, sr=self.sr, n_fft=1024, hop_length=512, power=2)
        log_melspec = librosa.power_to_db(melspec, ref=np.max)  # mels x time

        log_melspec = log_melspec.astype('float16')
        y_roi = y_roi.astype('float16')

        # DEBUG
        # print('spectrogram shape: ', log_melspec.shape)

        return log_melspec, y_roi
anddyhzw commented 2 years ago

Thank you very much!