Open chenting0324 opened 7 years ago
And it isn't an end-to-end speech recognition ,right? Thank you
Hi, the beam search is used in the acoustic model, this is a method already implemented by tensorflow, you can find the call here :
# Compute the prediction which is the best "path" of probabilities for each item of the batch decoded, _log_prob = tf.nn.ctc_beam_search_decoder(logits, self.input_seq_lengths)
decoded
is the list of solutions, ordered by probability descending, so we then use `decoded[0]' which is the output with the higher probability.
This is end-to-end because you give the network sound and it give back the sentence. Only thing lacking is a language model which would vastly improve the WER. But it's quite some work to implement because you have to do the beam search on the result of the language model, or maybe on the sum of the two models beams, not sure. I haven't look at it yet, too much work for now.
Hi, I also set the "max input length=65535",and use the command"python stt.py --file "filename"",the file is a MP3 file about 2 minutes long, so it takes a really long time, is there a better way to solve the problem?
Hi, creating a model with such an high value would probably take a lot of time in fact !
The best way to proceed would be to have the process_input
method in AcousticModel to take care of the slicing of the input array. Whatever value you set in max input length
the data should be sliced into chunks of that size and a loop would process it chronologically.
Of course you would have to keep the rnn_state during the loop, if you are interested to look into that issue you should probably try on the dev branch. The code for the model is cleaner and the internal state of the RNN is kept until your explicitly run the rnn_state_zero_op
so that should be easier than on the master branch.
Hi, the code doesn't achieve the beam search algorithm,right? So what does the code use to decode the CTC network?