cywang97 / StreamingTransformer

Apache License 2.0
271 stars 42 forks source link

issue about Viterbi decoding step #4

Closed ZhangChun96 closed 3 years ago

ZhangChun96 commented 4 years ago

I applied your viterbi-decoding step to the aishell1 dataset, the operation seems to be successful, but regarding the generated align, I don’t quite understand the meaning of the number corresponding to each sentence. Does the number represent the starting frame corresponding to the token in this sentence? Thanks a lot !

cywang97 commented 4 years ago

Hi, Viterbi decoding returns the CTC path with the highest probability. The number represents in which frame each token is generated in this path. The path is used for trigger-attention in streaming mode.

ZhangChun96 commented 4 years ago

Thanks a lot! There is another issue I want to confirm. Regarding the alignment of each audio generated, I compared the total number of audio features and the align result of the last token generated, and found that there is a multiple difference. For example, there are 12 tokens and 500 frames in the audio, but the align result of the last token generated may be 100. I think it is caused by the embedding layer. The total number of frames used for CTC alignment should be the number of frames after embedding. Is my understanding correct? Thanks!

cywang97 commented 4 years ago

Yes, the generated number is the position after CNN downsampling (embedding). As we use two CNN layers with stride 2, the length of the downsampled feature is 1/4 of the length of input utterance.