facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.94k stars 1.06k forks source link

failed in asr task #46

Open kaiser-ok opened 1 year ago

kaiser-ok commented 1 year ago

I try to test asr task in cli, but failed, do I miss anything?

$m4t_predict --model seamlessM4T_medium 16k.wav asr eng
2023-08-23 16:17:41,203 INFO -- m4t_scripts.predict.predict: Running inference on the GPU. Using the cached checkpoint of the model 'seamlessM4T_medium'. Set force=True to download again. Using the cached tokenizer of the model 'seamlessM4T_medium'. Set force=True to download again. Using the cached checkpoint of the model 'vocoder_36langs'. Set force=True to download again. Traceback (most recent call last): .... File "/home/kaisermac/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/kaisermac/miniconda3/lib/python3.11/site-packages/fairseq2/nn/transformer/relative_attention.py", line 293, in forward raise ValueError( ValueError: The input sequence length must be less than or equal to the maximum sequence length (4096), but is 16272 instead.

cbalioglu commented 1 year ago

@kaiser-ok from the error description, looks like the waveform you feed to the model exceeds the maximum sequence length once converted to log-mel filterbanks. Could you please try to run it with a shorter audio file and see if that fixes the problem?

tmclouisluk commented 1 year ago

@cbalioglu I got same error as well. It looks due to long audio file. Is it possible to support long audio in the future?

hegc commented 1 year ago

so, how long audio is supported? I tested with 1-minute 16kHz wav, tgt_lang "cmn", the result was very poor.