facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.9k stars 1.06k forks source link

Results of ASR are incomplete #150

Open ysapolovych opened 1 year ago

ysapolovych commented 1 year ago

My issue seems very similar to https://github.com/facebookresearch/seamless_communication/issues/83 , but I am using Translator Python API + ASR task. My input is 30 seconds long, and I get about half of it transcribed:

from seamless_communication.models.inference import Translator
import torch

device = torch.device('cuda:0')

translator = Translator('seamlessM4T_medium',
                        vocoder_name_or_card='vocoder_36langs',
                        device=device,
                        dtype=torch.float32)

full_text, wav, out_sr = translator.predict(input='1694165663513.wav',
                                 task_str='ASR',
                                 tgt_lang='eng',
                                 src_lang='eng',
                                 sample_rate=16000,
                                 ngram_filtering=True)

I wonder if params text_max_len_a, text_max_len_b, unit_max_len_a, and unit_max_len_b of predict method somehow contribute to that (alas, they are undocumented). Playing with them, however, did nothing.

casic commented 1 year ago

Yes. If a try 60 sec Audio, get 20 sec transcribing, If I send 20 seconds , get 10 sec. Transcribied. If send 10 seconds audio, get 5 seconds transcribed ?

BakingBrains commented 1 year ago

Anything on this?

Thank you

lixikun commented 1 year ago

meet the same question, someone konw this?