Mel-spectrum of 29.1s segment

annahung31 commented 5 years ago

Hi, thanks for the effort. I try to use the Mel-spectrum downloaded from gdrive to run the baseline but found that the downloaded files are full song. As a result, I try to run scripts/melspectrograms.py to get Mel-Spectrogram of 29.1s segment. However, I kept getting the error below:

RuntimeError: Error while configuring MelBands: Parameter normalize = "unit_tri" is not within specified range: {unit_sum,unit_max}

May I ask what did I miss? Thanks for the help.

annahung31 commented 5 years ago

I found that error means I have to change the argument 'normalize' from unit_tri to unit_sum or unit_max. I solve this but encounter another error:

RuntimeError: Error while configuring MelBands: TriangularBands: the number of spectrum bins is insufficient for the specified number of triangular bands. Use zero padding to increase the number of FFT bins.

I wonder if I use the scripts/melspectrograms.py in a wrong way?

annahung31 commented 5 years ago

I found that I should change the parameter zeroPadding=0 to 512, which is equals to frameSize. That solved the problem and get the Mel-Spectrogram with shape= (96, 1366), which is the shape indicated in the paper. Is that how you guys do for the baseline experiment?
Thanks!

annahung31 commented 5 years ago

To someone who might interested in this problem, in the end I use the original code from keunwoochoi. https://github.com/keunwoochoi/music-auto_tagging-keras/blob/master/audio_processor.py

rfalcon100 commented 5 years ago

To avoid recomputing all spectrograms, I made a small change to the dataset, so that every file is now cropped to the desired shape of (96, 1366). This is probably not the best way to do it, but it works. The getitem method looks like:

def __getitem__(self, index):
    fn = os.path.join(self.root, 'data/raw_30s_specs/', self.dictionary[index]['path'][:-3]+'npy')
    audio = np.array(np.load(fn)).astype('float32')
    tags = self.dictionary[index]['tags']

    # Transforms
    self.transform = transforms.Compose([
        transforms.ToPILImage(),
        transforms.CenterCrop((96, 1366)),
        transforms.ToTensor(),
     ])

    if self.transform:
        audio = self.transform(audio)

    return audio, tags.astype('float32')

There is another change needed in the model, because now the batches have shape (batch, channels, width, height) , so no need to unsqueeze.

    def forward(self, x):
        #x = x.unsqueeze(1) 

        # init bn
        x = self.bn_init(x)

        # layer 1
        x = self.mp_1(nn.ELU()(self.bn_1(self.conv_1(x))))
        # layer 2
        x = self.mp_2(nn.ELU()(self.bn_2(self.conv_2(x))))
        # layer 3
        x = self.mp_3(nn.ELU()(self.bn_3(self.conv_3(x))))
        # layer 4
        x = self.mp_4(nn.ELU()(self.bn_4(self.conv_4(x))))
        # layer 5
        x = self.mp_5(nn.ELU()(self.bn_5(self.conv_5(x))))

        # classifier
        x = x.view(x.size(0), -1)
        x = self.dropout(x)
        logit = nn.Sigmoid()(self.dense(x))

        return logit

dbogdanov commented 5 years ago

@annahung31 We have updated or PyPi wheels with the newest version of Essentia. Install or upgrade to the latest Essentia from pip and you should be able to run the spectrogram extraction code without a problem.

MTG / mtg-jamendo-dataset

Mel-spectrum of 29.1s segment #14