KinWaiCheuk / nnAudio

Audio processing by using pytorch 1D convolution network
MIT License
1.03k stars 89 forks source link

STFT Reconstruction from Mel Spectrograms #57

Open tasercake opened 4 years ago

tasercake commented 4 years ago

I've been playing around with trying to reconstruct an STFT spectrogram from a Mel spectrogram (derived using the MelSpectrogram class) and wondered if you might be interested in incorporating something of this sort into nnAudio.

I've created a Colab Notebook to demonstrate my results. The reconstruction quality as of now is slightly inferior to that of librosa, but is orders of magnitude faster. I tried my hand at some hyperparameter tuning, but judging by the values used by Torchaudio and Librosa, it seems like a lot more iterations (and a much lower LR?) are needed to achieve optimal reconstruction quality (which I don't have the compute resources to run hyperparameter search for). I've included some quick quality/speed comparisons in the Colab notebook.

My implementation is based on Librosa's mel_to_stft and TorchAudio's InverseMelScale.

If this is something you might be interested in adding to nnAudio, I'd be happy to open a pull request for further review.

KinWaiCheuk commented 4 years ago

Hi tasercake, I was also looking at it since few days ago. Gradient descend does not work well in this case since we are dealing with a sparse matrix (The Mel filter banks). It will take forever for gradient descend to find the right solution.

To make it works better and faster, you need non-negative least squares (NNLS) instead. There is no existing NNLS function in pytorch and you need to use L-BFGS-B algorithm to build your own NNLS in pytorch. Someone has already implmented L-BFGS-B in pytorch, you might want to use it to build the pytorch version of NNLS. https://github.com/hjmshi/PyTorch-LBFGS.

I will push a better verison of Griffin-Lim in a few days (My existing Griffin-Lim is also based on gradient descend which is also not as good as the librosa result, the new version of Griffin-Lim will be as good as librosa since it will be a direct clone from it). It might come into handy when you implment the InverseMelScale since you just need to add the mel_to_stft to this new version of Griffin-Lim to finish the InverseMelScale.

And yes, this feature would be very useful and I have been wanting to implment it. Thanks for you help.

KinWaiCheuk commented 4 years ago

Hi tasercake, after a second thought, I don't think NNLS is the right way to go, since it does not provide us the inverse matrix. Therefore we need to keep calling this function over and over again to estimate the STFT. Then it seems your approach is better.

Now, my idea is to estimate the inverse matrix for mel filter banks (probably use your approach), and use it for mel_to_stft conversion and then use griffim lim to get the audio back. Therefore this STFT reconstruction should be able to integrate with our existing Griffin_Lim ultimately.

f0k commented 3 years ago

Note that if you get the normalization correct, you also get quite decent results by just transposing the mel filterbank (in-the-wild example: https://github.com/bkvogel/griffin_lim/blob/master/run_demo.py#L110). It's also possible to use the pseudoinverse of the mel filterbank, but I found this often introduces audible artifacts.

KinWaiCheuk commented 3 years ago

you also get quite decent results by just transposing the mel filterbank

That is interesting to know. I have thought of that before, but it seemed too good to be true, so in the end I didn't try it. I would give it a try when I have time, but pull requests are welcome.