Open tasercake opened 4 years ago
Hi tasercake, I was also looking at it since few days ago. Gradient descend does not work well in this case since we are dealing with a sparse matrix (The Mel filter banks). It will take forever for gradient descend to find the right solution.
To make it works better and faster, you need non-negative least squares (NNLS) instead. There is no existing NNLS function in pytorch and you need to use L-BFGS-B algorithm to build your own NNLS in pytorch. Someone has already implmented L-BFGS-B in pytorch, you might want to use it to build the pytorch version of NNLS. https://github.com/hjmshi/PyTorch-LBFGS.
I will push a better verison of Griffin-Lim in a few days (My existing Griffin-Lim is also based on gradient descend which is also not as good as the librosa result, the new version of Griffin-Lim will be as good as librosa since it will be a direct clone from it). It might come into handy when you implment the InverseMelScale
since you just need to add the mel_to_stft
to this new version of Griffin-Lim to finish the InverseMelScale
.
And yes, this feature would be very useful and I have been wanting to implment it. Thanks for you help.
Hi tasercake, after a second thought, I don't think NNLS is the right way to go, since it does not provide us the inverse matrix. Therefore we need to keep calling this function over and over again to estimate the STFT. Then it seems your approach is better.
Now, my idea is to estimate the inverse matrix for mel filter banks (probably use your approach), and use it for mel_to_stft conversion and then use griffim lim to get the audio back. Therefore this STFT reconstruction should be able to integrate with our existing Griffin_Lim
ultimately.
Note that if you get the normalization correct, you also get quite decent results by just transposing the mel filterbank (in-the-wild example: https://github.com/bkvogel/griffin_lim/blob/master/run_demo.py#L110). It's also possible to use the pseudoinverse of the mel filterbank, but I found this often introduces audible artifacts.
you also get quite decent results by just transposing the mel filterbank
That is interesting to know. I have thought of that before, but it seemed too good to be true, so in the end I didn't try it. I would give it a try when I have time, but pull requests are welcome.
I've been playing around with trying to reconstruct an STFT spectrogram from a Mel spectrogram (derived using the
MelSpectrogram
class) and wondered if you might be interested in incorporating something of this sort into nnAudio.I've created a Colab Notebook to demonstrate my results. The reconstruction quality as of now is slightly inferior to that of
librosa
, but is orders of magnitude faster. I tried my hand at some hyperparameter tuning, but judging by the values used by Torchaudio and Librosa, it seems like a lot more iterations (and a much lower LR?) are needed to achieve optimal reconstruction quality (which I don't have the compute resources to run hyperparameter search for). I've included some quick quality/speed comparisons in the Colab notebook.My implementation is based on Librosa's
mel_to_stft
and TorchAudio'sInverseMelScale
.If this is something you might be interested in adding to nnAudio, I'd be happy to open a pull request for further review.