KinWaiCheuk / nnAudio

Audio processing by using pytorch 1D convolution network
MIT License
1.02k stars 89 forks source link

Explain difference to torch.stft #4

Closed faroit closed 4 years ago

faroit commented 4 years ago

You mentioned in the readme that

Other GPU audio processing tools are torchaudio and tf.signal. But they are not using the neural network approach, and hence the Fourier basis can not be trained.

Can you explain this in more detail, please?

Thanks!

KinWaiCheuk commented 4 years ago

(1) nnAudio contains trainable Fourier basis. That means the stft process can be trained together with the model via backpropagation. (I am pushing a new version soon. In the new version you will get a new argument trainable=True so that you can train your stft). torch.stft does not support trainable Fourier basis.

(2) nnAudio has more options for you to set the frequency scale (freq_scale='log' to get logarithmic frequency scale). Also nnAudio allows you to set the min and max frequency for the STFT output (for example fmin=50,fmax=6000). This option is not available in torch.stft.

(3) Finally nnAudio is faster than nnAudio?? (I also have doubt on this, discussions are welcome) I tried running a script to compare how much time does it take for nnAudio and torch.stft to finish converting 1000 random waveforms to spectrogram, and it seems nnAudio is faster?

https://colab.research.google.com/drive/1MMwpa9nvFlUly3aCnn1to9JeBKfXp94U

I am not sure if my timing method is correct or not, again discussions are welcome regarding this point. Personally, I believe torch.stft should be faster tho.

faroit commented 4 years ago

thanks for the reply. Sorry, my question was maybe not precise enough. Consider (1), I am interested in the setting of trainable=False where the stft is just used as a functional operator. In this setting, does it make a difference for the backpropagation algorithm which underlying operations are used?

KinWaiCheuk commented 4 years ago

I see. In that case, then you won't get any significant advantage in terms of speed when using nnAudio (But when I check the computation time for torch.stft and nnAudio, nnAudio is somehow faster. I am not sure if there is anything wrong in my timing method). But if you want to have more controls on the stft output, then point (2) I mentioned is what you can get from nnAudio (The output frequency range for torch.stft is always from 0Hz to Nyquist frequency).

For your second question, I think I am not knowledgeable enough to answer it since I do not have experience in using time domain loss yet. Are you working on waveform generation? May I know what time-domain loss do you use? I think I need to read more before I can answer this question.

faroit commented 4 years ago

For your second question, I think I am not knowledgeable enough to answer it since I do not have experience in using time domain loss yet. Are you working on waveform generation? May I know what time-domain loss do you use? I think I need to read more before I can answer this question.

A use case would be a spectrogram model with a time-domain loss that uses a non-trainable STFT/ISTFT inside the model (e.g. phase aware speech enhancement). For now, I always got very poor results using torch.stft/torchaudio.istft and I wonder if this is because of the unmatched combination of STFT/ISTFT operators -> libtorch fft/ 1dconv...

Mathematically this should not make a difference for backprop as long as the functions allow autograd, right?

maybe also pinging @keunwoochoi, @bmcfee, @carlthome on this.

carlthome commented 4 years ago

I always got very poor results using torch.stft/torchaudio.istft

Could you elaborate? Poor results in what context?

Mathematically this should not make a difference for backprop as long as the functions allow autograd, right?

Think that's right, but I've seen so many crazy subtle things these last years with e.g. tf.signal.fft having bugs in the backward pass because of some complex number handling being off-by-one etc. (like only computing derivatives with the real part instead of the magnitude or similar), and sometimes these issues are provoked by the forward pass so it's possible that two different STFTs in the same autograd engine would be subtly different because reasons.

Don't know enough about torchaudio's internal autograd but looks like they have problems with the graph compilation of the builtin STFT: https://pytorch.org/audio/_modules/torchaudio/functional.html#istft https://github.com/pytorch/pytorch/issues/21478 - if this means that frame centering is off or something, I donnu.

adrienchaton commented 4 years ago

just to point out, I tried torchaudio and it seems there are some issues as my models that train well using torch.stft do not train well using torchaudio operators, in my case I use only direct transforms and not inverse, for computing spectral losses for raw waveform generation so I dropped it, I am curious though about trying nnAudio and will !

KinWaiCheuk commented 4 years ago

I am curious though about trying nnAudio and will !

Thanks for the interest in nnAudio. nnAudio is still in the early stage of development, if you find any bugs and problems, free feel to ask here again, I will try my best to solve the problems and improve nnAudio.

torch.stft is a really good option, since no extra dependency is required for it unlike torchaudio. What makes nnAudio different from torch.stft is the trainable stft.

adrienchaton commented 4 years ago

@KinWaiCheuk I opened 2 issues about some details from my first runs with nnAudio. Your tools are good, of course it's early stage but a great base to start with. I hope it's not too much throwing issues at you at this point, just wanted to point out a few things.

Best

HudsonHuang commented 4 years ago

Maybe something is wrong, or maybe an implementation problem. The torch.stft which using CuFFT should be much faster than using CuDNN we use in here, especially when using such a large kernel. However, it is useful anyway, thanks.

jjhuang-ca commented 4 years ago

For spectrogram, it would be good to be completely compatible with librosa. Currently, the win_length is assumed to be the same as n_fft. In speech processing, it is common to use n_fft=512, but win_length=400 (with zero-padding up to 512). Is there a plan for this modification?

KinWaiCheuk commented 4 years ago

I will bear this in mind and add this feature in my next release.

carlthome commented 4 years ago

For spectrogram, it would be good to be completely compatible with librosa. Currently, the win_length is assumed to be the same as n_fft. In speech processing, it is common to use n_fft=512, but win_length=400 (with zero-padding up to 512). Is there a plan for this modification?

@jjhuang-ca, interesting! What's the benefit of zero padding each frame? I've long thought that since the DFT should be a power of two (for doing FFT), then each frame should just be filled up with the signal as necessary as the window function takes care of aliasing anyway (assuming sane defaults like a Hann function).

jjhuang-ca commented 4 years ago

@carlthome The speech processing community long ago decided that win_length=400 is a good choice for time resolution, so it's basically standardized for speech recognition. You're right that FFT needs power of 2, and 512 is the closest to 400. The Hann window is applied to the 400 input samples, zero pad to 512, then FFT. If you look at any speech processing pipelines this is how it's done (in Kaldi, ESPNet, etc.).

KinWaiCheuk commented 4 years ago

For spectrogram, it would be good to be completely compatible with librosa. Currently, the win_length is assumed to be the same as n_fft. In speech processing, it is common to use n_fft=512, but win_length=400 (with zero-padding up to 512). Is there a plan for this modification?

I have included the win_length argument and published a new version 0.1.2.dev3. You can get it by pip install nnAudio --pre -U.

I tried few test cases, and the output should be same as librosa. If you find any problem, please free feel to report it here.