Closed faroit closed 4 years ago
(1) nnAudio contains trainable Fourier basis. That means the stft process can be trained together with the model via backpropagation. (I am pushing a new version soon. In the new version you will get a new argument trainable=True
so that you can train your stft). torch.stft
does not support trainable Fourier basis.
(2) nnAudio
has more options for you to set the frequency scale (freq_scale='log'
to get logarithmic frequency scale). Also nnAudio
allows you to set the min and max frequency for the STFT output (for example fmin=50,fmax=6000
). This option is not available in torch.stft.
(3) Finally nnAudio
is faster than nnAudio
?? (I also have doubt on this, discussions are welcome)
I tried running a script to compare how much time does it take for nnAudio
and torch.stft
to finish converting 1000 random waveforms to spectrogram, and it seems nnAudio
is faster?
https://colab.research.google.com/drive/1MMwpa9nvFlUly3aCnn1to9JeBKfXp94U
I am not sure if my timing method is correct or not, again discussions are welcome regarding this point. Personally, I believe torch.stft
should be faster tho.
thanks for the reply. Sorry, my question was maybe not precise enough. Consider (1), I am interested in the setting of trainable=False
where the stft is just used as a functional operator. In this setting, does it make a difference for the backpropagation algorithm which underlying operations are used?
I see. In that case, then you won't get any significant advantage in terms of speed when using nnAudio (But when I check the computation time for torch.stft
and nnAudio
, nnAudio
is somehow faster. I am not sure if there is anything wrong in my timing method). But if you want to have more controls on the stft output, then point (2) I mentioned is what you can get from nnAudio (The output frequency range for torch.stft is always from 0Hz to Nyquist frequency).
For your second question, I think I am not knowledgeable enough to answer it since I do not have experience in using time domain loss yet. Are you working on waveform generation? May I know what time-domain loss do you use? I think I need to read more before I can answer this question.
For your second question, I think I am not knowledgeable enough to answer it since I do not have experience in using time domain loss yet. Are you working on waveform generation? May I know what time-domain loss do you use? I think I need to read more before I can answer this question.
A use case would be a spectrogram model with a time-domain loss that uses a non-trainable STFT/ISTFT inside the model (e.g. phase aware speech enhancement). For now, I always got very poor results using torch.stft/torchaudio.istft and I wonder if this is because of the unmatched combination of STFT/ISTFT operators -> libtorch fft/ 1dconv...
Mathematically this should not make a difference for backprop as long as the functions allow autograd, right?
maybe also pinging @keunwoochoi, @bmcfee, @carlthome on this.
I always got very poor results using torch.stft/torchaudio.istft
Could you elaborate? Poor results in what context?
Mathematically this should not make a difference for backprop as long as the functions allow autograd, right?
Think that's right, but I've seen so many crazy subtle things these last years with e.g. tf.signal.fft
having bugs in the backward pass because of some complex number handling being off-by-one etc. (like only computing derivatives with the real part instead of the magnitude or similar), and sometimes these issues are provoked by the forward pass so it's possible that two different STFTs in the same autograd engine would be subtly different because reasons.
Don't know enough about torchaudio's internal autograd but looks like they have problems with the graph compilation of the builtin STFT: https://pytorch.org/audio/_modules/torchaudio/functional.html#istft https://github.com/pytorch/pytorch/issues/21478 - if this means that frame centering is off or something, I donnu.
just to point out, I tried torchaudio and it seems there are some issues as my models that train well using torch.stft do not train well using torchaudio operators, in my case I use only direct transforms and not inverse, for computing spectral losses for raw waveform generation so I dropped it, I am curious though about trying nnAudio and will !
I am curious though about trying nnAudio and will !
Thanks for the interest in nnAudio. nnAudio is still in the early stage of development, if you find any bugs and problems, free feel to ask here again, I will try my best to solve the problems and improve nnAudio.
torch.stft
is a really good option, since no extra dependency is required for it unlike torchaudio. What makes nnAudio
different from torch.stft
is the trainable stft.
@KinWaiCheuk I opened 2 issues about some details from my first runs with nnAudio. Your tools are good, of course it's early stage but a great base to start with. I hope it's not too much throwing issues at you at this point, just wanted to point out a few things.
Best
Maybe something is wrong, or maybe an implementation problem. The torch.stft which using CuFFT should be much faster than using CuDNN we use in here, especially when using such a large kernel. However, it is useful anyway, thanks.
For spectrogram, it would be good to be completely compatible with librosa. Currently, the win_length is assumed to be the same as n_fft. In speech processing, it is common to use n_fft=512, but win_length=400 (with zero-padding up to 512). Is there a plan for this modification?
I will bear this in mind and add this feature in my next release.
For spectrogram, it would be good to be completely compatible with librosa. Currently, the win_length is assumed to be the same as n_fft. In speech processing, it is common to use n_fft=512, but win_length=400 (with zero-padding up to 512). Is there a plan for this modification?
@jjhuang-ca, interesting! What's the benefit of zero padding each frame? I've long thought that since the DFT should be a power of two (for doing FFT), then each frame should just be filled up with the signal as necessary as the window function takes care of aliasing anyway (assuming sane defaults like a Hann function).
@carlthome The speech processing community long ago decided that win_length=400 is a good choice for time resolution, so it's basically standardized for speech recognition. You're right that FFT needs power of 2, and 512 is the closest to 400. The Hann window is applied to the 400 input samples, zero pad to 512, then FFT. If you look at any speech processing pipelines this is how it's done (in Kaldi, ESPNet, etc.).
For spectrogram, it would be good to be completely compatible with librosa. Currently, the win_length is assumed to be the same as n_fft. In speech processing, it is common to use n_fft=512, but win_length=400 (with zero-padding up to 512). Is there a plan for this modification?
I have included the win_length
argument and published a new version 0.1.2.dev3. You can get it by pip install nnAudio --pre -U
.
I tried few test cases, and the output should be same as librosa. If you find any problem, please free feel to report it here.
You mentioned in the readme that
Can you explain this in more detail, please?
when would I benefit from the STFT in nnAudio compared to let's say
torch.stft
?does it make a difference which STFT I use when I am interested in a time domain loss, hence does it change backprop?
Thanks!