evonneng / learning2listen

Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)
107 stars 9 forks source link

RE. Unable to reproduce audio 128-D mel spectrogram feature from raw video #13

Open nguyenntt97 opened 1 year ago

nguyenntt97 commented 1 year ago

Problem statement

I am trying to reproduce the audio feature pre-processing for a longer time-window sequence experiment, but the only available detailed instructions were from #2. However, in the answer, the script seemed to extract the MFCC features from an extracted? audio which returned an output with a different shape (1 x 4T x 20) compared to the audio feature in the dataset (1 x 4T x 128).

Issue reproduction

My snippet on Google Collab could be found HERE

Yes, we chose *4T to allow for temporal alignment with the 30fps framerate of the videos just to make it easier to process both the audio and the video frames in a unified way. The T here refers to the number of frames in the video clip. So for the purposes of this paper, T=64**. The exact code used to calculate the melspecs is as follows:

def load_mfcc(audio_path, num_frames):
    waveform, sample_rate = librosa.load('{}'.format(audio_path), sr=16000)
    win_len = int(0.025*sample_rate)
    hop_len = int(0.010*sample_rate)
    fft_len = 2 ** int(np.ceil(np.log(win_len) / np.log(2.0)))
    S_dB = librosa.feature.mfcc(y=waveform, sr=sample_rate, hop_length=hop_len)   # This line by default only extract 20 MFCCs

    ## do some resizing to match frame rate
    im = Image.fromarray(S_dB)
    _, feature_dim = im.size
    scale_four = num_frames*4
    im = im.resize((scale_four, feature_dim), Image.ANTIALIAS)
    S_dB = np.array(im)
    return S_dB

Hope this helps!

I also tried to extract the Mel spectrogram normally and even combined it with librosa's power_to_db but the scale between my output and the original dataset was still somehow not correct.

S_dB = librosa.feature.melspectrogram(y=waveform, sr=sample_rate)
# optional
# S_dB = librosa.power_to_db(S_dB)

Below are the expected output and outputs from the Mel spectrogram function before and after power_to_db. I extracted them from the same video file done_conan_videos0/021LukeWilsonIsStartingToLookLikeChrisGainesCONANonTBS from the raw dataset based on metadata provided by *_files_clean*. I assumed that correct preprocessing would produce the same output as your original dataset.

My output

array([3.49407317e-04, 1.72899290e-05, 9.88764441e-06, 9.31489740e-06, 2.19979029e-05, 4.02382248e-05, 5.83300316e-05, 1.78770599e-04,

After powering

array([-34.508053, -46.10779 , -48.621204, -49.872578, -46.910652, -42.93151 , -41.84772 , -37.57675 , -38.1189 , -38.486935,

Here's the dataset

array([[-50.593018, -47.35103 , -45.426086, -41.643738, -42.111137, -41.75349 , -41.146526, -38.722565, -39.55792 , -39.344612,

My question

May I ask your advice on how to extract audio features in detail for reproducing the dataset? I believe other readers shared the same question with me--- see THIS

Eric-Gty commented 12 months ago

Hi, I would like to know if you have figured out this now. I've tried to make the audio like (1, 4T, 128) as Mel spectrogram with specific hop_length and n_mels parameter within librosa.feature.melspectrogram function.

Running the provided script to produce MFCC feature, it should be with (1, 4T, 20) instead of 128. If you have more understanding about the reason to use MFCC and to create feature with 20-dimension, please let me know :)

evonneng commented 9 months ago

Hello! Thank you for your interest in our work! I've updated the comment in #2 with the exact audio extraction method. Please let me know if that helps. In general, I applied the same type of scaling and such as with the mfcc code, but swap that line with the melspec code.