domerin0 / rnn-speech

Character level speech recognizer using ctc loss with deep rnns in TensorFlow.
MIT License
77 stars 31 forks source link

"max_input_seq_length" #26

Closed squiba closed 7 years ago

squiba commented 7 years ago

why this input is provided to AcousticModel. Model's parameters don't depend on this as this is just the number of timesteps the network is unrolled ( please tell if i am wrong ). we can always unroll the rnn as per the input instance.

Isn't it creating some inefficiency as the smaller inputs are padded by zero. and the model can be quite flexible if this input is removed.

Instead we give input of any length to rnn

rnn_output, self.hidden_state = tf.nn.dynamic_rnn(cell, inputs, sequence_length=self.input_seq_lengths,
initial_state=init_state, time_major=True)

So depending upon the length of audio instance the network will unroll itself.

AMairesse commented 7 years ago

I'm not sure I understand correctly your question. "max_input_seq_length" is in fact the maximum size we allow for an audio input. This is because the model is unrolled. The model need to be unrolled when learning because that's the only way that the CTC algorithm can calculate the loss and the resulting gradients. On the code you're quoting the "sequence_length" parameter is a tensor containing the real length of each sample from the batch. This way the rnn does not need to unroll the maximum number of time but only sufficiently for the longest sample in the batch. The padding is needed because we store the data into a tensor, so we need to have a fixed size matrix, the only way is to set a maximum. Technically that's the input tensor which constraint the maximum unrolling.

However this will not be a limitation when translating audio to text outside of the training situation. It's currently not implemented but the main loop should run a small portion of audio into the rnn and loop, keeping the hidden_state between calls to the rnn.

squiba commented 7 years ago

Thanks for explaining the importance of "sequence_length" (i wasn't aware of this).

Can "max_input_seq_length" be initialized to 'None'(dynamic shape) and given as max(input_seq_lengths) at runtime.

AMairesse commented 7 years ago

Well that's a good question, never thought of it... In fact only the input and logits depend on it but it may work with a placeholder instead of a fixed value. I'm not sure because I don't know if tf.split would work with a placeholder but it should worth the try... I'm currently working on some improvement from https://arxiv.org/pdf/1609.05935v2.pdf and I won't be available next week, but I'll keep it open to try. Feel free to investigate if you want of course !

AMairesse commented 7 years ago

Having tried it I can confirm that this is needed. Another optimization is available in the dev branch. Using size ordering the model is destroyed and re-created for each checkpoint, this way the unroll is limited to the needed value and will only reach max_input_seq_length for the largest files of the training set (or for the test set). Training time is up to 3 times faster on the beginning using this method, but the learning curve seems to be less effective. Training is currently in progress and if the result is at least as good as the current pre-trained model I will release a new pre-trained model when merging the dev branch.

Note : pre-trained models from master branch cannot be used in dev branch for now because of a bug in master which is corrected in dev branch ("/output_b:0" was not saved in checkpoint files)