deepsound-project / samplernn-pytorch

PyTorch implementation of SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
MIT License
288 stars 75 forks source link

SampleRNN as audio feature extractor #18

Open iariav opened 6 years ago

iariav commented 6 years ago

hi, this is more a question then an issue - i'm looking for a way to extract features from raw audio wav files and then use these features for different tasks such as voice recognition, voice activity detection an such, not for generative tasks, i thought of somehow modifying a generative model like SampleRNN\WaveNet so it could be used to only encode the data to some feature space. can you please give some pointers on what modifications i need to do to the model to achieve that? has anyone already done this before? any help would be greatly appreciated.

Cortexelus commented 6 years ago

You have a sequence (audio clip) and want classify it using an RNN (SampleRNN). Perhaps this is a vector classifying speaker_id.

Often I see this done by running the RNN through the entire clip, then, using the final state of the RNN, add more layers (fully connected perhaps), then finally a softmax layer. If you have 10 speakers, your softmax layer is a vector size 10. (You do crossentropy loss because its multiclass classification.)

Because there may be 100,000+ of timesteps, a possible compute-hurdle is the backpropagation through time. But in this case, instead of doing TBPTT at each time step (for generation), you only need to do one full BPTT at the end. So.. my guess it should be faster than generative sampleRNN.

You need to take out the TBPTT and next-sample-prediction at every time step. Instead, you wait until its done reading the entire audio sequence. Get the final RNN state. Connect it to the new layers, predict the speaker_id, then do a full BPTT through all the timesteps.

That's one place to start.

Often I see bidirectional RNN (run a forwards-time RNN, AND a backwards-time RNN, then concatenate the final states of both before your fully connected layers before the top ) having better results for this kind task.

I haven't seen specifically SampleRNN used for this. Normally I see ppl run conv nets on spectrograms for audio classification