SeanNaren / deepspeech.pytorch

Speech Recognition using DeepSpeech2.
MIT License
2.1k stars 620 forks source link

Does the LSTM adapt the size of the output char sequence during training for a given utterance? #101

Closed dlmacedo closed 7 years ago

dlmacedo commented 7 years ago

Dear Friends,

Is the size of the output (the sequence of chars) of the model a deterministic function of the number of frames of the utterance?

Does the LSTM adapt the size of the output char sequence during training for a given utterance (same number of frames)?

Or for a given number of frames in the utterance, the output will always predict an output sequence with the same number of characters?

Thanks for the answer,

David

SeanNaren commented 7 years ago

Yes this is based on the convolutional net and the parameters you choose for the spectrogram. Currently given a second of audio, this creates 100 timesteps, that after deepspeech, go to 50 time steps to make character predictions on.

dlmacedo commented 7 years ago

So many time steps are necessary to predict a single character, right?

How do we know how many time steps will need to predict each character since we have more time steps than character?

When predicting, how the decoder knows how to aggregate many time steps of probabilities to generate each character to appear on the transcript?

ryanleary commented 7 years ago

I recommend you read the following papers for background on this kind of architecture and the CTC loss function. Most if not all are available on arxiv.

  1. Graves, Jaitly; Towards End-to-End Speech Recognition with Recurrent Neural Networks
  2. Maas, et. al; Lexicon-Free Conversational Speech Recognition with Neural Networks
  3. Hannun, et. al; First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs
dlmacedo commented 7 years ago

Thanks!