RobinSchmidt / RS-MET

Codebase for RS-MET products (Robin Schmidt's Music Engineering Tools)
Other
56 stars 6 forks source link

Need machine learning process #270

Open elanhickler opened 5 years ago

elanhickler commented 5 years ago

We need a Mono audio as the input (time domain and/or FFT/frequency domain data). I need a time instant as the output.

Someone suggested Tensor Flow: https://www.tensorflow.org/guide/extend/cc?fbclid=IwAR3bU4-tQ2LMXlpT61F4FCDfOy5u9NLUXdcOGcoOYuPveZ0nC-EBGzp05i8

You said you had some interest in doing machine learning. But the FFT stuff is extremely important right now, I don't want to distract you from that. So I'm thinking of finding an additional programmer for machine learning stuff. Unless you want to convince me to wait for you to finish FFT stuff and then have you work on creating something with machine learning.

I'll be looking for a programmer.

RobinSchmidt commented 5 years ago

what do you want to do with neural networks in the context of digital audio? i think, they may be interesting in the context of modeling nonlinear feedback systems. i would probably start here:

https://papers.nips.cc/paper/1276-neural-network-modeling-of-speech-and-music-signals.pdf

the author was the supervisor of my master thesis and wrote a phd thesis about it - this is just a brief summary (but the full thesis is in german...but if you are interested, i can dig it out) and i read it and understood it ....12 years ago. and i have actually a multilayer perceptron implementation in my codebase that i programmed back then just for fun and am still looking for applications for that :-) (it has to brushed up, though - but it was generally working)

RobinSchmidt commented 5 years ago

oh...well...you never actually said neural networks...you just said machine learning....which is a much broader subject. ...but i think, mostly the people mean neural networks when they say machine learning

RobinSchmidt commented 5 years ago

We need a Mono audio as the input (time domain and/or FFT/frequency domain data). I need a time instant as the output.

wait - what - i may have misunderstood you ...i was just thinking "neural networks" and had all sorts of associations (mainly nonlinear time series prediction). ...you want a time instant as output? what should that time instant represent?

RobinSchmidt commented 5 years ago

...what i want to say: if you are interested in modeling musical instruments with neural networks, i'm definitely interested and have ...some (*)...background in that area

(*) i have a minor in computer science with focus on artificial intelligence ...but it's long ago...my code probably can't compete with tensorflow...idk

RobinSchmidt commented 5 years ago

well..."competing" with tensorflow is probably not the issue - maybe you want to use the api within a plugin? ...hmm.so far, i thought, tensorflow is a python thing but it seems to also available for c++

elanhickler commented 5 years ago

Essentially I want a transient detector based on machine learning specifically for plucked sounds, and if we imagine FFT to be like an image, (2d representation of sound) should be able to feed it an image, and then get an x-axis value for the transient location.

We could look into FFT functions to do the same thing, you'll have to make more progress on FFT functions first. But having machine learning to do it means we can train the AI for specific kinds of instruments / audio / situations.

RobinSchmidt commented 5 years ago

We could look into FFT functions to do the same thing

did you actually look into my OnsetDetector class (in rosic/analysis)? it actually does just that. i'm looking for sudden increases of frequency content from one FFT frame to the next. it takes a time-domain signal as input though - but internally, it computes a spectrogram and processes it on the fly (it just remembers the previous FFT frame).

i wonder how neural networks could be applied to this task and what their benefits are supposed be compared to directly looking for increasing freq-content energy. you would need a lot of annotated data as well to train the network

elanhickler commented 5 years ago

We have a lot of data to train it, and the advantages is that it could be trained to specific types of audio and give us specific locations in the audio.

OnsetDetector didn't work, it needed tweaking for my use case, but we can return to this once we have the harmonic extraction and resynthesis stuff more developed.

RobinSchmidt commented 5 years ago

yes, i think, before looking into totally different approaches, it is worthwhile to investigate why the OnsetDetector didn't work on your material. maybe it really just needs a little tweaking - for example - i have a frequency weighting function, that weights low-freq energy increases higher than high-freq energy increases - the reason being that it was written for beat detection in a full mixdown - and here a bass-drum hit should count more than a hihat. what is the nature of your material? if it is phrases of plucked string playing, i think, maybe higher freqs should be weighted more strongly

RobinSchmidt commented 5 years ago

i think, the idea of looking for magnitude increases from one frame to the next in any bin...and somehow aggregating that over all bins is a very straightforward and meaningful idea. i think, if you have an idea what to look for, you should look for that rather than applying a totally generic process. also, typically you need a pre-processing stage before the neural network anyway. for example, in speech recognition, you would feed a feature vector that may contain - for example - the frequencies of the first 3 formants and a noisyness measure (and maybe a few more features).....but not a full fft frame consisting of thousands of bins. these "features" would typically be extracted by classical dsp methods