Closed robclouth closed 2 years ago
Hi ! There are no prior assumptions about the signal, and I've tried training it on percussions (see https://caillonantoine.github.io/2021/11/18/rave.html for an audio example !) and it's working pretty well !
PS: I think Zero Point is a fantastic work of art ! I've been listening to it over and over while writing the paper haha !
Hey! Thanks for the quick reply. Ah that's awesome. Gonna have to try this on a jungle drum loop dataset now. I'd tried with DDSP but it assumed everything was harmonic and had a strong f0 which doesn't make sense for percussion.
Glad you liked zero point! Parts of that were done with my (awful) beatboxing and audio mosaicing to reconstruct my voice with proper drumloops, but the technique isn't realtime and the quality isn't great.
The goal is to beatbox live and reconstruct my voice with different trained models hooked up to the keys of a keyboard. Then when I hold down multiple keys, it would blend the models somehow like a kind of chord.
My main concern is that my voice will be so different to the data that feeding it in won't translate the semantic parts (kicks, snares, etc) to the correct semantic parts in the dataset.
I thought about training an autoencoder just on my voice, then for each trained RAVE model, train a small "mapping" network to map the latent spaces.
Anyway, this has gone way beyond a github issue now. Thanks for making these awesome projects available for n00bs like me!
(p.s. the logo is great. Looks like something scrawled on the wall in a toilet cubicle)
The examples in the paper are all harmonic. Have you tried it with percussive/non-harmonic sounds? Does the model make any assumptions about that?