facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.8k stars 1.05k forks source link

Odd results in ASR. Does it have a chat language model and text smoothing? #437

Open cageyoko opened 5 months ago

cageyoko commented 5 months ago

There are three cases that I found obvious different with reference labels.

  1. When I said "hello", the output of model is "hello, how are you" image image image

  2. When I repeat said a same word, the output of model will delete the repeat part, and some filled pause such as "um" also will be deleted. I think it image image

  3. Here are some output that's completely unrelated to audio. image

I think it can be explained by outputting correct speech sentences based on semantic analysis instead of the original ASR results?