CQT2010v2 outputs 10 prints for every forward

adrienchaton commented 4 years ago

Hi, so far I had good results switching to your library for computing spectral reconstruction losses for raw waveform generation !

However, using CQT2010v2 does 10 prints "downsample_factor = 4" for every forward, which is not desired when using it for minibatch training .. Can it be disabled please ? (maybe as an argument)

In my case I do not manually set "earlydownsample=False" and leave it to default. Maybe it affects the print. Also, could you give a quick recommendation on this setting please ?

Thanks

KinWaiCheuk commented 4 years ago

Hi Adrien, thanks for point it out. It was a stupid mistake by me, I was doing debugging previously and forget to remove the prints afterwards. I have removed all the prints, I think you won't see those annoying prints anymore.

adrienchaton commented 4 years ago

Thanks !

I have an other error with CQT2010v2. At the first forward it gives:

File "/home/mil/adrien/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, **kwargs) File "/home/mil/adrien/.local/lib/python3.6/site-packages/nnAudio/Spectrogram.py", line 950, in forward CQT1 = self.get_cqt(x_down, hop, self.padding) File "/home/mil/adrien/.local/lib/python3.6/site-packages/nnAudio/Spectrogram.py", line 884, in get_cqt CQT_real = conv1d(x, self.cqt_kernels_real, stride=hop_length) RuntimeError: invalid argument 11: stride should be greater than zero, but got dH: 1 dW: 0 at /opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/generic/SpatialConvolutionMM.cu:16

I work at 22050Hz sampling rate. I want to compute the CQT for 6 hop_size = [32,64,128,256,512,1024] corresponding to different resolutions as n_fft = hop*4. == [128->4096] The CQT does not take n_fft as parameter, instead I increase the bins_per_octave as np.linspace(12,48,num=6,dtype='int'). And the other setting is fmin=60.,fmax=sr/2.

If for instance I put every bins_per_octave to 24, it does forward. But it looses the interest of multiple resolutions (less bins when smaller hops but more frames and more bins when larger hop and less frames).

According to librosa, hop_size must be a multiple of 2*(n_bins / bins_per_octave). Should I rather set n_bins as multiple of int(np.log(hop_size)/np.log(2)bpo) ? Or what would be the good practice to set this multi resolution CQT place ?

KinWaiCheuk commented 4 years ago

I can also reproduce the same error when the hop size is very small (32). I am not sure what is the cause for this yet, will investigate first and come back to you when I figured something out.

Do you get this when using other spectrogram methods? Or it is only in CQT2010v2?

adrienchaton commented 4 years ago

Only CQT2010v2 so far. Also, it seems that the same series of settings (hop size, fmin, fmax, sr) does not cause error on the CQT1992v2.

KinWaiCheuk commented 4 years ago

In fact, it is recommended to use CQT1992v2 instead of CQT2010v2, since CQT1992v2 yields a smoother result. I will update the README in a few days showing the difference between the two. But if you really want to have the same CQT algorithm with librosa, then CQT2010v2 is the closest one.

Anyhow, I will try to figure out what is happening with CQT2010v2 when the hop size is too small.

adrienchaton commented 4 years ago

Thanks for the feedback. I indeed observed that CQT1992v2 looked better. Just as it was older, I tried to still run a comparison with training plots. I will just leave it to the oldies, I do not need to match librosa, so I will stay on the 1992v2 if using CQT !

KinWaiCheuk commented 4 years ago

According to librosa, hop_size must be a multiple of 2*(n_bins / bins_per_octave). Should I rather set n_bins as multiple of int(np.log(hop_size)/np.log(2)bpo) ? Or what would be the good practice to set this multi resolution CQT place ?

The problem is the hop_length is too small. librosa and nnAudio CQT2010v2 are using the same downsampling CQT algorithm. I have attached an image below to show how it works.

Basically, this algorithm will downsamples the audio for n-1 times, where n is the number of octaves. Each downsampling reduces sampling rate of the audio by half. As a side effect, the hop_length will be reduced by half too.

For example, you set n_bins=88, and bins_per_octave=12, then there are 8 octaves. That means 7 downsamplings are required. If you have a hop_length=32, then 32/2^7 = 0.25, which is not a valid hop_length.

However, if you set bins_per_octave=24, then the number of octaves becomes 4. i.e. 3 downsampling is required. 32/2^3=4, which is a valid hop_length.

This problem is related to the CQT algorithm used, and there is nothing we can do with it. To avoid this error, please start with a bigger hop_length to avoid the invalid hop_length after downsampling.

I hope it can answer your question.

adrienchaton commented 4 years ago

Thank you for your insights and taking time to explain these details !

I have set to use the 1992v2 so far, if computing CQT, which does not have issues with small hop sizes and computes smoother spectrograms.

However it seems running slower than the 2010v2. Is that correct ? Your poster comparison seems to indicate that on GPU computation time is as: CQT2010 > CQT2010v2 > CQT1192v2

But the couple of trainings I ran seemed to indicate CQT1192v2 > CQT2010v2

I tried but it is no big deal in my experiment, so far it seems that the best spectral loss operator is the STFT with log frequency scale. A great tool from your library !

KinWaiCheuk commented 4 years ago

When I try to run CQT2010v2 and CQT1992v2 in isolation, I still can see that CQT1992v2 is faster. Here are the tests conducted on Google collab: CQT1992v2 CQT2010v2

Can you tell me what parameters you are using now? For example, n_bins, bins_per_ocatave, hop_length.

adrienchaton commented 4 years ago

You must be right. I did not use so much the CQT but one experiment I did with either CQT1992v2 or CQT2010v2 seemed to run faster with the 2nd one. I was running it at 22k with bpo = 24 and hop in [64,128,256,512,1024]. No n_bins, as I set fmax to sr/2.

Then I tried to use only the CQT1992v2 with increasing bpo as the hop size increases. But my results so far go in the direction of using log-freq. STFT or Mels, fit is better and faster than with CQT. So I'll probably leave the CQT for now.

KinWaiCheuk commented 4 years ago

When I was using CQT to do my music transcription experiments, I also find that CQT does not perform well as STFT or Melspectrograms. Two papers below also point out that CQT is not as good as other spectrograms such as Mel or STFT.

On the potential of simple framewise approaches to piano transcription Singing style investigation by residual siamese convolutional neural networks

They don't have a explicit explanation and I am also not sure what makes CQT worse than other spectrograms.

Btw, what is the task you are doing and what is the model you are using now? It seems I have asked you before...

adrienchaton commented 4 years ago

You definitely have more experience with CQT than me so I cannot add more to your knowledge on that .. thanks for the reference and for your feedback that also confirms my first impression on training either with Spectrogram/Mels or CQT.

It is of course task dependent and CQT must be most relevant to other training cases.

For my experiment, I am working on a two latent space VAE for raw waveform generation. Maybe the term for that is hierarchical VAE. I am not sure, basically one is for autoencoding more local waveform features. On top of that is a recurrent VAE that aggregates more macro waveform features. I am not using a particular pre-established model but trying to develop mine.

But main references for this work are: DDSP https://openreview.net/forum?id=B1x1ma4tDr NSF https://arxiv.org/abs/1904.12088 SING https://arxiv.org/abs/1810.09785

And I use spectral losses as reconstruction costs for the raw waveform modeling.

KinWaiCheuk commented 4 years ago

I met the team who wrote that DDSP paper during ISMIR 2019. Their results is really good, and their paper is difficult to understand as usual. Many people are doing VAE architecture now, and I can see the potential in VAE. I think I will also start doing VAE related research. Thanks for sharing papers related to your research, I can't wait to read your paper in the future.

adrienchaton commented 4 years ago

I will post codes on github after I submit and get reviewed. ;-) The DDSP has interesting and subtle details related to DSP in particular. I liked that and actually use a few DSP techniques in my current model. I dropped the totally naive approaches to put more sound related constrains in the model.

On the learning / ML part it is rather simple. And I do also tend to make it simpler in my last experiments, at the benefit of experimenting more with DSP concepts put into the ML.

In case you have some questions about VAE, feel free to ask, it's indeed a very interesting model !

KinWaiCheuk / nnAudio

CQT2010v2 outputs 10 prints for every forward #9