Open shawnbzhang opened 3 years ago
Audio2Mel does the following to extract the mel spectrogram:
data, sampling_rate = load(full_path, sr=self.sampling_rate) data = 0.95 * normalize(data) if self.augment: amplitude = np.random.uniform(low=0.3, high=1.0) data = data * amplitude return torch.from_numpy(data).float(), sampling_rate
which is forwarded as:
def forward(self, audio): p = (self.n_fft - self.hop_length) // 2 audio = F.pad(audio, (p, p), "reflect").squeeze(1) fft = torch.stft( audio, n_fft=self.n_fft, hop_length=self.hop_length, win_length=self.win_length, window=self.window, center=False, ) real_part, imag_part = fft.unbind(-1) magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2) mel_output = torch.matmul(self.mel_basis, magnitude) log_mel_spec = torch.log10(torch.clamp(mel_output, min=1e-5)) return log_mel_spec
Is there a benefit of doing this over Torchaudio's mel spectrogram function, e.g.:
data, sampling_rate = torchaudio.load(full_path) melspec_ops = torchaudio.transforms.MelSpectrogram(sample_rate=sampling_rate, n_fft=self.n_fft, win_length=self.win_length, hop_length=self.hop_length, f_min=0, f_max=None, n_mels=self.n_mel_channels) mel_spec = melspec_ops(data) log_mel_spec = torch.log10(mel_spec + 0.000000001) return log_mel_spec
I'm just curious about this design choice — it wasn't really touched in the paper.
Side quesiton: Why do you multiply the normalized waveform by 0.95 in the original method?
Can you share the full Audio2Mel code
Audio2Mel does the following to extract the mel spectrogram:
which is forwarded as:
Is there a benefit of doing this over Torchaudio's mel spectrogram function, e.g.:
I'm just curious about this design choice — it wasn't really touched in the paper.
Side quesiton: Why do you multiply the normalized waveform by 0.95 in the original method?