Problems with inference on short audio clips

SonyCSLParis / pesto

Self-supervised learning for fast pitch estimation

GNU Lesser General Public License v3.0

168 stars 15 forks source link

This model predicts frequencies very accurately, however, when I try to apply it on shorter length audio clips, the CQT calculations get wrong. My audio sampling_rate is 16000 and I set step_size to 20 milliseconds. When I use audio with a length less than 16385 for prediction, in the forward method of CQT Module, because reflect pad_mode is used, the compiler reports an error: "Padding size should be less than the corresponding input dimension." I don't know much about CQT calculations. , so I don't know how to solve this problem. For shorter length audio clips, do I have to pad it every time I do inference? Or, can I change pad_mode to constant? Will this affect the calculation results? Or do you have any better suggestions? Thank you very much.

Hi, Thanks for your message and sorry for late reply. I am not an expert of the CQT either, but from what I understood it uses a different window size for each frequency bin, with longer windows in low frequencies. To circumvent this issue, increasing the minimal frequency could be a possibility; however, since the trained model accepts a fixed-size input, it may be quite hacky to make it work. Currently, the implementation of the CQT comes from nnAudio and maybe one can find a better implementation. Apparently, there exists some real-time implementations of the CQT such as this repo (https://github.com/jmerkt/rt-cqt), but I didn't try it myself so it would require further exploration.

In other words, there are probably some solutions to this problem, but I don't see any simple one. Sorry for that and good luck!

SonyCSLParis / pesto

Problems with inference on short audio clips #29