Closed ZhangChun96 closed 3 years ago
Hi, Viterbi decoding returns the CTC path with the highest probability. The number represents in which frame each token is generated in this path. The path is used for trigger-attention in streaming mode.
Thanks a lot! There is another issue I want to confirm. Regarding the alignment of each audio generated, I compared the total number of audio features and the align result of the last token generated, and found that there is a multiple difference. For example, there are 12 tokens and 500 frames in the audio, but the align result of the last token generated may be 100. I think it is caused by the embedding layer. The total number of frames used for CTC alignment should be the number of frames after embedding. Is my understanding correct? Thanks!
Yes, the generated number is the position after CNN downsampling (embedding). As we use two CNN layers with stride 2, the length of the downsampled feature is 1/4 of the length of input utterance.
I applied your viterbi-decoding step to the aishell1 dataset, the operation seems to be successful, but regarding the generated align, I don’t quite understand the meaning of the number corresponding to each sentence. Does the number represent the starting frame corresponding to the token in this sentence? Thanks a lot !