SonyCSLParis / pesto-full

Full models and training code for PESTO
GNU Lesser General Public License v3.0
48 stars 12 forks source link

Question: polyphonic mono-timbral pitch detection + inference speed info? #3

Closed KoenT-SS closed 3 months ago

KoenT-SS commented 3 months ago

Hi,

this is not an issue or bug (no worries ;-) ), but rather 2 questions:

I just read the paper (very nice work!) and I was wondering if you have been working on applying this SSL method on polyphonic music coming from a single instrument ("mono-timbral"), like piano.

If I understood things correctly, this should be possible, as the main concept about the transposition equivariance should still hold, right? Of course, there can now be multiple simultaneous pitches (sometimes 1, sometimes 5, ...), and there would need to be another method to decide which probabilities are high enough to pick as notes (a simple argmax won't suffice here). But the probability vectors should still be shifted versions when transposing, so... Did you already try this? (if not, do you know of other systems that can do that using SSL by any chance?)

Also, would it be possible to give some indication of how fast the inference processing actually is (number of times faster than real-time, including CQT, but not including audio file loading, on a standard CPU)? I don't think I found inference speed performance numbers in the paper (or maybe I skipped it somehow?).

Kind regards, Koen

aRI0U commented 3 months ago

Hi! Thanks a lot for your interest in our work.

Regarding polyphony, there is no theoretical reason why it wouldn't work, as you pointed out. However when switching to polyphonic you have to replace the final softmax layer by a sigmoid or sth like that, and in practice predicting an arbitrary number of pitches in a SSL way is way harder than predicting one. I did a few experiments with this but I didn't manage to have a stable training. Applying pesto to multiple pitches could be an interesting research direction, however the direct naive approach doesn't work

About SSL for multipitch estimation, I think there is only a single paper on that, called "Towards self-supervised multipitch estimation", but I don't know if there is an available implementation though.

Regarding speed of the model, on my laptop CPU the model is 12-15x times faster than real-time. However for a real real-time application, you'd need to compute the CQT frames individually, which is challenging because the cqt has huge kernels in the low frequencies. There are probably ways to circumvent this issue, but for this also the naive solution won't work directly

Let me know if you have other questions!

KoenT-SS commented 3 months ago

Hi Alain, thank you for your response.

OK, I understand the polyphonic case has not been explored in-depth (yet?). Hopefully you will be able to work on that more at some point. It would be a very elegant system if that could be made to work as well. I think a single polyphonic instrument like piano would be a good start (multi-instrument mixes seem a bit too complicated, and that's also not always necessary in practice).

Thanks also for the reference to that paper. I'll check it out.

For a real-time implementation, I would expect that a C++ port may perform better than a Python version on CPU. And yes, the low frequencies are going to lead to more latency due to the time/frequency "uncertainty principle" causing the big kernels at the low end, but for some applications this might be acceptable, even if it's like up to 1 second.

You answered my questions, so feel free to close the issue now.

Thanks again, Koen

aRI0U commented 3 months ago

Cool! I'll let you know if there are updates :eyes: