Open cageyoko opened 5 months ago
There are three cases that I found obvious different with reference labels.
When I said "hello", the output of model is "hello, how are you"
When I repeat said a same word, the output of model will delete the repeat part, and some filled pause such as "um" also will be deleted. I think it
Here are some output that's completely unrelated to audio.
I think it can be explained by outputting correct speech sentences based on semantic analysis instead of the original ASR results?
There are three cases that I found obvious different with reference labels.
When I said "hello", the output of model is "hello, how are you"
When I repeat said a same word, the output of model will delete the repeat part, and some filled pause such as "um" also will be deleted. I think it
Here are some output that's completely unrelated to audio.
I think it can be explained by outputting correct speech sentences based on semantic analysis instead of the original ASR results?