Why exactly 4T in extracting Mels?

Daksitha commented 2 years ago

I am creating my own dataset. It is mentioned to use librosa.feature.melspectrogram(...) to process the speaker's audio in this format: (1 x 4T x 128) I could not find the reason why exactly you use this 4 times T? Is there a way to calculate this number?.

Your dataset contains videos with 30fps, but could not find the details about audio freq. Did you set the hop_length = 22050 (sr) / 30 (fps) * 4? Or used the default hope_length in

 librosa.feature.melspectrogram(*, y=None, sr=22050, S=None, n_fft=2048, hop_length=512, win_length=None, window='hann', center=True, pad_mode='constant', power=2.0, **kwargs)

Did you choose 4T such that the frames are overlapped to reduce the spectral leakage of trimmed windows?

evonneng commented 2 years ago

Yes, we chose *4T to allow for temporal alignment with the 30fps framerate of the videos just to make it easier to process both the audio and the video frames in a unified way. The T here refers to the number of frames in the video clip. So for the purposes of this paper, T=64**. The exact code used to calculate the melspecs is as follows:

def load_mfcc(audio_path, num_frames):
    waveform, sample_rate = librosa.load('{}'.format(audio_path), sr=16000)
    win_len = int(0.025*sample_rate)
    hop_len = int(0.010*sample_rate)
    fft_len = 2 ** int(np.ceil(np.log(win_len) / np.log(2.0)))
    S_dB = librosa.feature.mfcc(y=waveform, sr=sample_rate, hop_length=hop_len)

    ## do some resizing to match frame rate
    im = Image.fromarray(S_dB)
    _, feature_dim = im.size
    scale_four = num_frames*4
    im = im.resize((scale_four, feature_dim), Image.ANTIALIAS)
    S_dB = np.array(im)
    return S_dB

Hope this helps!

[Update]: thank yall for pointing out the above was for extracting mfcc. To calculate the mfcc, and to get the feature to b 128, you can set n_mel = 128 under the call to librosa.feature.melspectorgram.

def load_melspec(audio_path, num_frames):
    waveform, sample_rate = librosa.load('{}'.format(audio_path), sr=16000)
    win_len = int(0.025*sample_rate)
    hop_len = int(0.010*sample_rate)
    fft_len = 2 ** int(np.ceil(np.log(win_len) / np.log(2.0)))
    S_dB = librosa.feature.melspectrogram(y=waveform, sr=sample_rate, hop_length=hop_len, win_length=win_len, n_fft=fft_len)

    ## do some resizing to match frame rate
    im = Image.fromarray(S_dB)
    _, feature_dim = im.size
    scale_four = num_frames*4
    im = im.resize((scale_four, feature_dim), Image.ANTIALIAS)
    S_dB = np.array(im)
    return S_dB

Daksitha commented 2 years ago

@evonneng Thank you so much! This indeed help me to clarify my doubts. Thanks for your great work and lovely codebase

sumeromer commented 1 year ago

Hi @evonneng, I am training your models on a custom database, and I wanted to share this just for clarification: In the paper and repository/shared data files, you used Mel spectrograms (feature_dim=128), but the script you shared here uses MFCC features. Do you have a specific reason to change it here to MFCC?

S_dB = librosa.feature.melspectrogram(y=waveform, sr=sample_rate, hop_length=hop_len)  # hop_length=160

evonneng commented 9 months ago

Sorry, I copied and pasted the wrong code. Thanks for pointing that out! I updated the above comment with the correct version of the snippet.

evonneng / learning2listen

Why exactly 4T in extracting Mels? #2