Closed UttaranB127 closed 3 years ago
Hello Bhattacharya,
audio_feat
is for melspectrogram features, but those features were not used in the final model. I share the code snippet that I used. This code is not runnable, but I believe it would be enough to understand the audio features.
def extract_audio_feat(self, video_total_frames, video_start_frame, video_end_frame):
# roi
start_frame = math.floor(video_start_frame / video_total_frames * self.n)
end_frame = math.ceil(video_end_frame / video_total_frames * self.n)
y_roi = self.y[start_frame:end_frame]
# feature extraction
melspec = librosa.feature.melspectrogram(
y=y_roi, sr=self.sr, n_fft=1024, hop_length=512, power=2)
log_melspec = librosa.power_to_db(melspec, ref=np.max) # mels x time
log_melspec = log_melspec.astype('float16')
y_roi = y_roi.astype('float16')
# DEBUG
# print('spectrogram shape: ', log_melspec.shape)
return log_melspec, y_roi
audio_raw
and audio_feat
were extracted from original TED videos.
Hello Youngwoo,
Thanks for the snippet, I think I follow most of it. How is the variable y obtained from the video file?
You can download audio tracks from youtube videos. And the variable y
is from the audio file:
librosa.load('test.mp3', mono=True, sr=16000, res_type='kaiser_fast')
Ah fantastic, thanks a lot!
Hello, thanks for your response to issue #18. I went through the processing code for YouTube Gesture Dataset (https://github.com/youngwoo-yoon/youtube-gesture-dataset). I found the processing code for most of the modalities, but did not find a processing code for the audio features. Could you also provide some pointers for obtaining the audio features? Specifically, how are the values for the keys
'audio_raw'
and'audio_feat'
in the video dictionary (stored in the .mdb files) obtained from sound files, e.g., .wav?