DSP to parse audio signal into MIDI sequence

FeedBackDevs / feedback

FeedBack - Music & Rhythm Game Engine

GNU General Public License v2.0

36 stars 2 forks source link

DSP to parse audio signal into MIDI sequence #3

Open TurkeyMan opened 10 years ago

TurkeyMan commented 10 years ago

Necessary to support vocals and 'pro' guitar.

Must be very low latency!

p0nce commented 10 years ago

Here a short term FFT analyzer. https://github.com/p0nce/dplug/blob/master/dsp/dplug/dsp/fft.d

If I understand correctly you want blind separation of many sources mixed together. All I know is that for monophonic signals time-domain methods are faster, more accurate and with lower latency than FFT and for polyphonic signals it all break down and you have to go frequential, which brings quite a lot of latency.

Do you really need low latency? You might preprocess the songs.

TurkeyMan commented 10 years ago

That helps! :)

I suspect lots of filtering/smoothing of the output will be required that will be fairly tricky to get accurate readings at very low latency. Different voices, male/female, and picking up to 6 signals from a mixed guitar signal... these need to be made robust.

p0nce commented 10 years ago

OK (stop me if I'm wrong) the inputs are:

monophonic voice signal (a)
polyphonic guitar chords mixed together (b)

Desired output:

note onset / off
pitch

For (a), Autotune claim to use auto-correlation methods (very basically FFT of a FFT then peak detection) to detect pitch. There are rumors that it's actually time-domain, and in my experience you can have something like 10ms latency for typical material. As for (b), Melodyne separates guitar chords, and it's an impressive tool for pitch, but I really don't know how they do it. You should ask on KVR Audio section DSP.

Note onset/offset is not that easy too since thresholds will inevitably be volume dependent.

TurkeyMan commented 10 years ago

Sounds more or less right to me. I have no idea how the polyphonic signal separation is done, but the vox one sounds about right.

10ms is probably okay. Frames are 16ms, and the UI layer draws later in the frame, so it can be afforded the better part of the frame (most time spent rendering the background scene). I don't know how bad it would feel if visual response was a frame late... just one frame might be okay, but 2 is a lot. I can easily feel 2, and I'm personally pretty sensitive to even one frame latency.

It's a pretty involved piece of work. Hopefully someone more qualified than me steps forward to have a go at it! :)

p0nce commented 10 years ago

I will probably add a pitch detector to dplug, that I did for voice, I just need to port it from C++. It was meant to be secret but what the heck. It also works for monophonic harmonic signals like a single guitar chord but strangely not for pure sines.

Unfortunately the latency of the audio API (and buffer size) has a way higher impact then mere detection. To have a simultaneous feel I had to make the audio host use ASIO and lower the buffer size to several ms.

TurkeyMan commented 10 years ago

Yeah, I suspect some headache with the capture API's. We'll see how it goes when we get there. I think the simpler instruments like drums will come first ;)

p0nce commented 10 years ago

https://github.com/p0nce/dplug/blob/master/dsp/dplug/dsp/goldrabiner.d

I've made a test program which output a WAV with pitch, voiced/unvoiced and a crude resynthesized output with volume = 1. https://github.com/p0nce/dplug/blob/master/examples/pitch_detect/pitch_detect.d

The thing to get is that when there is no pitch (voicedness towards 0), the pitch output is wrong and shouldn't be used.

It can be used for monophonic voice and probably other instruments.