domerin0 / rnn-speech

Character level speech recognizer using ctc loss with deep rnns in TensorFlow.
MIT License
77 stars 31 forks source link

Question this version change #25

Closed qqueing closed 7 years ago

qqueing commented 7 years ago

In process_input function, this code is insertion.

predictions = session.run(tf.sparse_tensor_to_dense(outputs[0], default_value=len(_CHAR_MAP), validate_indices=True)) transcribed_text = self.transcribe_from_prediction(predictions)

I think that model.prediction of previous version code is not merge same CTC label. I don't know problem's reason. How is this line run? Could you explain this line and previous code?

AMairesse commented 7 years ago

In the previous code we had :

outputs = session.run(output_feed, input_feed) transcribed_text = self.transcribe_from_prediction(outputs[0])

it was working but only if the session processed 1 file at a time. The result we get in outputs is not a dense tensor, its a sparse one. In this previous version you can see that the transcribe_from_prediction method was also different. It was not using the input directly as a dense tensor, it was using prediction.values. In a sparse tensor every value different from 0 is in values. As we had only one result we were using the fact that every value was ordered in the sparse tensor so we could simply read it from the values field in the given order.

The new version allow to process a batch of multiple files so we can no longer process this way. Now the values field of the sparse tensor contain each value from each row. We can't easily know when we start the transcription from a new file. And we are definitely not sure that the data is ordered one file at a time, we could have 2 letters from the first file and then the first letter from the second file, etc... It would required to parse the indices field at the same time to set the value to the correct file and the correct position in the sentence.

Instead I chose to convert the sparse tensor to a dense one with :

tf.sparse_tensor_to_dense(outputs[0], default_value=len(_CHAR_MAP), validate_indices=True)

I set the default value in the dense tensor to len(_CHAR_MAP) which allow to "find the end" of each sentence in the transcribe_from_prediction method. In fact each row in the dense tensor can be as long as "max_target_seq_length", but it usually isn't that long and in a dense tensor we now need to distinguish real values from default ones.

qqueing commented 7 years ago

Thank you for your kindly answer.

One more question. is " if 0 <= index < len(_CHAR_MAP)" need in code? is index out of range?

AMairesse commented 7 years ago

Yes in fact because the default value for the sparse_to_dense method is len(_CHAR_MAP) so if an audio file correspond to a sentence with 156 characters then for the corresponding row in the dense tensor columns from 157 to "max_target_seq_length-1" will contain the value len(_CHAR_MAP) which is out of range. This is a trick because at this point we don't have the real length of each sentence. In the sparse tensor you can think of the non-explicit values as if they where 'null', we set them to len(_CHAR_MAP) in the dense tensor in order to be able to find the end of each sentence.

A more explicit way would have been to exploit the indices field to find the max of each row and use it in combination with the values field to reconstruct each sentence. But I think this way is faster, although I didn't measured it.