ai4r / Gesture-Generation-from-Trimodal-Context

Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity (SIGGRAPH Asia 2020)
Other
245 stars 35 forks source link

Code for obtaining the audio features #21

Closed UttaranB127 closed 3 years ago

UttaranB127 commented 3 years ago

Hello, thanks for your response to issue #18. I went through the processing code for YouTube Gesture Dataset (https://github.com/youngwoo-yoon/youtube-gesture-dataset). I found the processing code for most of the modalities, but did not find a processing code for the audio features. Could you also provide some pointers for obtaining the audio features? Specifically, how are the values for the keys 'audio_raw' and 'audio_feat' in the video dictionary (stored in the .mdb files) obtained from sound files, e.g., .wav?

youngwoo-yoon commented 3 years ago

Hello Bhattacharya, audio_feat is for melspectrogram features, but those features were not used in the final model. I share the code snippet that I used. This code is not runnable, but I believe it would be enough to understand the audio features.

def extract_audio_feat(self, video_total_frames, video_start_frame, video_end_frame): 
        # roi 
        start_frame = math.floor(video_start_frame / video_total_frames * self.n) 
        end_frame = math.ceil(video_end_frame / video_total_frames * self.n) 
        y_roi = self.y[start_frame:end_frame] 

        # feature extraction 
        melspec = librosa.feature.melspectrogram( 
            y=y_roi, sr=self.sr, n_fft=1024, hop_length=512, power=2) 
        log_melspec = librosa.power_to_db(melspec, ref=np.max)  # mels x time 

        log_melspec = log_melspec.astype('float16') 
        y_roi = y_roi.astype('float16') 

        # DEBUG 
        # print('spectrogram shape: ', log_melspec.shape) 

        return log_melspec, y_roi 

audio_raw and audio_feat were extracted from original TED videos.

UttaranB127 commented 3 years ago

Hello Youngwoo,

Thanks for the snippet, I think I follow most of it. How is the variable y obtained from the video file?

youngwoo-yoon commented 3 years ago

You can download audio tracks from youtube videos. And the variable y is from the audio file: librosa.load('test.mp3', mono=True, sr=16000, res_type='kaiser_fast')

UttaranB127 commented 3 years ago

Ah fantastic, thanks a lot!