Open AndreyBocharnikov opened 1 year ago
Clarification for the first question - because here spectrogram is being normalized by the argument I think the answer on my question will be yes.
Clarification for the second question - in the formula number (5) in the SoundStream paper the log is being taken of the mel-spec when computing L2 part of the multi-scale spectral reconstruction loss, so the question remains, did you remove it on purpose?
And one more question about the multi-scale spectral reconstruction loss, when constructing MelSpectrogram
with 64 n_mels
and window_size
< 512 I get the following warning (4 of them)
/opt/conda/lib/python3.10/site-packages/torchaudio/functional/functional.py:539: UserWarning: At least one mel filterbank has all zero values. The value for n_mels (64) may be set too high. Or, the value for n_freqs (65) may be set too low.
Is it an expected behavior and I should leave this loss as it is?
❓ Questions
Hello, my question is about reconstruction loss in frequency domain, in a paragraph 3.4 it is stated that you use "mel-spectrogram using a normalized STFT", what type of normalization is mentioned here? Is it sufficient to use
normalized
flag oftorchaudio.transforms.MelSpectrogram
which normalizes "by magnitude after stft"?Also in practice stft loss is sometimes computed via log mel-spectrogram for better convergence, so I want to clarify, in your implementation, S_i from formula 1 is a mel-spectrogram or log mel-spectrogram?