Closed qqueing closed 7 years ago
In the previous code we had :
outputs = session.run(output_feed, input_feed)
transcribed_text = self.transcribe_from_prediction(outputs[0])
it was working but only if the session processed 1 file at a time. The result we get in outputs
is not a dense tensor, its a sparse one. In this previous version you can see that the transcribe_from_prediction
method was also different. It was not using the input directly as a dense tensor, it was using prediction.values
.
In a sparse tensor every value different from 0 is in values
. As we had only one result we were using the fact that every value was ordered in the sparse tensor so we could simply read it from the values
field in the given order.
The new version allow to process a batch of multiple files so we can no longer process this way. Now the values
field of the sparse tensor contain each value from each row. We can't easily know when we start the transcription from a new file. And we are definitely not sure that the data is ordered one file at a time, we could have 2 letters from the first file and then the first letter from the second file, etc... It would required to parse the indices
field at the same time to set the value to the correct file and the correct position in the sentence.
Instead I chose to convert the sparse tensor to a dense one with :
tf.sparse_tensor_to_dense(outputs[0], default_value=len(_CHAR_MAP), validate_indices=True)
I set the default value in the dense tensor to len(_CHAR_MAP) which allow to "find the end" of each sentence in the transcribe_from_prediction
method.
In fact each row in the dense tensor can be as long as "max_target_seq_length", but it usually isn't that long and in a dense tensor we now need to distinguish real values from default ones.
Thank you for your kindly answer.
One more question. is " if 0 <= index < len(_CHAR_MAP)" need in code? is index out of range?
Yes in fact because the default value for the sparse_to_dense
method is len(_CHAR_MAP)
so if an audio file correspond to a sentence with 156 characters then for the corresponding row in the dense tensor columns from 157 to "max_target_seq_length-1" will contain the value len(_CHAR_MAP) which is out of range. This is a trick because at this point we don't have the real length of each sentence. In the sparse tensor you can think of the non-explicit values as if they where 'null', we set them to len(_CHAR_MAP) in the dense tensor in order to be able to find the end of each sentence.
A more explicit way would have been to exploit the indices
field to find the max of each row and use it in combination with the values
field to reconstruct each sentence. But I think this way is faster, although I didn't measured it.
In process_input function, this code is insertion.
predictions = session.run(tf.sparse_tensor_to_dense(outputs[0], default_value=len(_CHAR_MAP), validate_indices=True)) transcribed_text = self.transcribe_from_prediction(predictions)
I think that model.prediction of previous version code is not merge same CTC label. I don't know problem's reason. How is this line run? Could you explain this line and previous code?