SampleRNN as audio feature extractor

You have a sequence (audio clip) and want classify it using an RNN (SampleRNN). Perhaps this is a vector classifying speaker_id.

Often I see this done by running the RNN through the entire clip, then, using the final state of the RNN, add more layers (fully connected perhaps), then finally a softmax layer. If you have 10 speakers, your softmax layer is a vector size 10. (You do crossentropy loss because its multiclass classification.)

Because there may be 100,000+ of timesteps, a possible compute-hurdle is the backpropagation through time. But in this case, instead of doing TBPTT at each time step (for generation), you only need to do one full BPTT at the end. So.. my guess it should be faster than generative sampleRNN.

You need to take out the TBPTT and next-sample-prediction at every time step. Instead, you wait until its done reading the entire audio sequence. Get the final RNN state. Connect it to the new layers, predict the speaker_id, then do a full BPTT through all the timesteps.

That's one place to start.

Often I see bidirectional RNN (run a forwards-time RNN, AND a backwards-time RNN, then concatenate the final states of both before your fully connected layers before the top ) having better results for this kind task.

I haven't seen specifically SampleRNN used for this. Normally I see ppl run conv nets on spectrograms for audio classification

deepsound-project / samplernn-pytorch

SampleRNN as audio feature extractor #18