cvondrick / soundnet

SoundNet: Learning Sound Representations from Unlabeled Video. NIPS 2016
http://projects.csail.mit.edu/soundnet/
MIT License
461 stars 94 forks source link

size problems for audio classification #21

Open Aidenfaustine opened 2 years ago

Aidenfaustine commented 2 years ago

I am so sorry to disturb you. when i use pre-train soundnet to speech emotion recognition, I have some questions. Could you please give me a hand? Thanks

Question 1: wav, sr = torchaudio.load(path) reads the audio samples, then it is preprocessed by wav.unsqueeze(1).unsqueeze(-1).repeat(1,1,8,1). what are the requirements for the audio sample rate? Does the sample rate must be 22050? what are the other restrictions?

Question 2: the last layer is nn.Conv2d(1024, 401, kernel_size=(8, 1), stride=(2, 1)) to extract speech features. Feature size varies depending upon the length of the audio, what does it depend upon? I want to use the feature for audio classification. How do I get constant dimension feature vector for all of my audio files? the same as your mentioned, an audio file with 1476864 samples produces feature of dimension [1x1024x46x1] and other files with 2199168 samples produce a feature of dimension [1x1024x68x1]. [1x1024x46x1], 1 represents batch, 1024 channel_out, what is 46 represented? what is the last dimension 1 represented?

Question 3: How do get constant dimension feature vector for both files? Finally, when I try to classify, What do I need to do with the features of ouput ( 1, 401, feature, 1)so that I can use them in the final classification task? how can the faltten method be better, (batch, channel_out* 1, feature)? average on the channel? or other methods?

PS I am new to audio and DL, sorry ask basic problem Thanks best

mayqinxu commented 2 years ago

hi, I think you can check out the main_train.lua file. It has some annotations that may help you. Also in main_train it mentioned that the audio clip lasts 20s and the sample rate is 22050.

And also lua files in data folder contain all the data preprocessing steps, you can follow those codes to preprocess your data.

For Question 2, 46 means the 'length'. the difference between 46 and 68 is because the inputs' length is different(1476864 and 2199168). As for the last dimension, I think it represents mono sound or just represents nothing but to help the audio data meet the input requirent for the nn module requires 4 dimension input.