facebookresearch / encodec

State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.
MIT License
3.31k stars 299 forks source link

Reconstruction Loss #32

Open AndreyBocharnikov opened 1 year ago

AndreyBocharnikov commented 1 year ago

❓ Questions

Hello, my question is about reconstruction loss in frequency domain, in a paragraph 3.4 it is stated that you use "mel-spectrogram using a normalized STFT", what type of normalization is mentioned here? Is it sufficient to use normalized flag of torchaudio.transforms.MelSpectrogram which normalizes "by magnitude after stft"?
Also in practice stft loss is sometimes computed via log mel-spectrogram for better convergence, so I want to clarify, in your implementation, S_i from formula 1 is a mel-spectrogram or log mel-spectrogram?

AndreyBocharnikov commented 1 year ago

Clarification for the first question - because here spectrogram is being normalized by the argument I think the answer on my question will be yes.

AndreyBocharnikov commented 1 year ago

Clarification for the second question - in the formula number (5) in the SoundStream paper the log is being taken of the mel-spec when computing L2 part of the multi-scale spectral reconstruction loss, so the question remains, did you remove it on purpose?

AndreyBocharnikov commented 1 year ago

And one more question about the multi-scale spectral reconstruction loss, when constructing MelSpectrogram with 64 n_mels and window_size < 512 I get the following warning (4 of them) /opt/conda/lib/python3.10/site-packages/torchaudio/functional/functional.py:539: UserWarning: At least one mel filterbank has all zero values. The value for n_mels (64) may be set too high. Or, the value for n_freqs (65) may be set too low. Is it an expected behavior and I should leave this loss as it is?