Open nguyenntt97 opened 1 year ago
Hi, I would like to know if you have figured out this now. I've tried to make the audio like (1, 4T, 128) as Mel spectrogram with specific hop_length and n_mels parameter within librosa.feature.melspectrogram function.
Running the provided script to produce MFCC feature, it should be with (1, 4T, 20) instead of 128. If you have more understanding about the reason to use MFCC and to create feature with 20-dimension, please let me know :)
Hello! Thank you for your interest in our work! I've updated the comment in #2 with the exact audio extraction method. Please let me know if that helps. In general, I applied the same type of scaling and such as with the mfcc code, but swap that line with the melspec code.
Problem statement
I am trying to reproduce the audio feature pre-processing for a longer time-window sequence experiment, but the only available detailed instructions were from #2. However, in the answer, the script seemed to extract the MFCC features from an extracted? audio which returned an output with a different shape (
1 x 4T x 20
) compared to the audio feature in the dataset (1 x 4T x 128
).Issue reproduction
My snippet on Google Collab could be found HERE
I also tried to extract the Mel spectrogram normally and even combined it with librosa's
power_to_db
but the scale between my output and the original dataset was still somehow not correct.Below are the expected output and outputs from the Mel spectrogram function before and after
power_to_db
. I extracted them from the same video filedone_conan_videos0/021LukeWilsonIsStartingToLookLikeChrisGainesCONANonTBS
from the raw dataset based on metadata provided by*_files_clean*
. I assumed that correct preprocessing would produce the same output as your original dataset.My question
May I ask your advice on how to extract audio features in detail for reproducing the dataset? I believe other readers shared the same question with me--- see THIS