Closed Daksitha closed 2 years ago
Yes, we chose *4T to allow for temporal alignment with the 30fps framerate of the videos just to make it easier to process both the audio and the video frames in a unified way. The T here refers to the number of frames in the video clip. So for the purposes of this paper, T=64**. The exact code used to calculate the melspecs is as follows:
def load_mfcc(audio_path, num_frames):
waveform, sample_rate = librosa.load('{}'.format(audio_path), sr=16000)
win_len = int(0.025*sample_rate)
hop_len = int(0.010*sample_rate)
fft_len = 2 ** int(np.ceil(np.log(win_len) / np.log(2.0)))
S_dB = librosa.feature.mfcc(y=waveform, sr=sample_rate, hop_length=hop_len)
## do some resizing to match frame rate
im = Image.fromarray(S_dB)
_, feature_dim = im.size
scale_four = num_frames*4
im = im.resize((scale_four, feature_dim), Image.ANTIALIAS)
S_dB = np.array(im)
return S_dB
Hope this helps!
[Update]:
thank yall for pointing out the above was for extracting mfcc. To calculate the mfcc, and to get the feature to b 128, you can set n_mel = 128 under the call to librosa.feature.melspectorgram
.
def load_melspec(audio_path, num_frames):
waveform, sample_rate = librosa.load('{}'.format(audio_path), sr=16000)
win_len = int(0.025*sample_rate)
hop_len = int(0.010*sample_rate)
fft_len = 2 ** int(np.ceil(np.log(win_len) / np.log(2.0)))
S_dB = librosa.feature.melspectrogram(y=waveform, sr=sample_rate, hop_length=hop_len, win_length=win_len, n_fft=fft_len)
## do some resizing to match frame rate
im = Image.fromarray(S_dB)
_, feature_dim = im.size
scale_four = num_frames*4
im = im.resize((scale_four, feature_dim), Image.ANTIALIAS)
S_dB = np.array(im)
return S_dB
@evonneng Thank you so much! This indeed help me to clarify my doubts. Thanks for your great work and lovely codebase
Hi @evonneng, I am training your models on a custom database, and I wanted to share this just for clarification: In the paper and repository/shared data files, you used Mel spectrograms (feature_dim=128), but the script you shared here uses MFCC features. Do you have a specific reason to change it here to MFCC?
S_dB = librosa.feature.melspectrogram(y=waveform, sr=sample_rate, hop_length=hop_len) # hop_length=160
Sorry, I copied and pasted the wrong code. Thanks for pointing that out! I updated the above comment with the correct version of the snippet.
I am creating my own dataset. It is mentioned to use librosa.feature.melspectrogram(...) to process the speaker's audio in this format: (1 x 4T x 128) I could not find the reason why exactly you use this 4 times T? Is there a way to calculate this number?.
Your dataset contains videos with 30fps, but could not find the details about audio freq. Did you set the hop_length = 22050 (sr) / 30 (fps) * 4? Or used the default hope_length in
Did you choose 4T such that the frames are overlapped to reduce the spectral leakage of trimmed windows?