Closed dlmacedo closed 7 years ago
Yes this is based on the convolutional net and the parameters you choose for the spectrogram. Currently given a second of audio, this creates 100 timesteps, that after deepspeech, go to 50 time steps to make character predictions on.
So many time steps are necessary to predict a single character, right?
How do we know how many time steps will need to predict each character since we have more time steps than character?
When predicting, how the decoder knows how to aggregate many time steps of probabilities to generate each character to appear on the transcript?
I recommend you read the following papers for background on this kind of architecture and the CTC loss function. Most if not all are available on arxiv.
Thanks!
Dear Friends,
Is the size of the output (the sequence of chars) of the model a deterministic function of the number of frames of the utterance?
Does the LSTM adapt the size of the output char sequence during training for a given utterance (same number of frames)?
Or for a given number of frames in the utterance, the output will always predict an output sequence with the same number of characters?
Thanks for the answer,
David