facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
10.53k stars 1.02k forks source link

Wrong result for traditional Chinese #362

Closed fumin closed 4 months ago

fumin commented 4 months ago

I have the following 5 second audio (it's a video because silly github does not support uploading audio, you can extract the audio by ffmpeg -i short.mp4 -vn short.wav):


It is in traditional Chinese.

When I use the following code in google colaboratory:

model_name = "seamlessM4T_v2_large"
vocoder_name = "vocoder_v2"

translator = seamless_communication.inference.Translator(

in_file = "short.wav"
tgt_lang = "cmn_Hant"
tgt_lang = "cmn"

text_output, _ = translator.predict(

I get 一九三八年组织队在南川山头, which is wrong.

However the huggingface demo gives 院长文勇委员反渗透法完成删读了, which is correct.

How is it that using pytorch gives wrong results? Is it because Chinese support is not very good yet?

avidale commented 4 months ago

Hi! The code for the huggingface demo looks mostly like yours, but it uses audio preprocessing: https://huggingface.co/spaces/facebook/seamless-m4t-v2-large/blob/main/app.py#L74

An important part of the processing is resampling. Seamless expects 16000 sample rate, but the result of running your ffmpeg produces an audio with 48000 samples per second.

If you resample the input to 16KHz, the output will be the same as in the HF demo.

fumin commented 4 months ago


Thanks for your suggestion. However, using a 16KHz input results in the following error: ValueError: The input waveform must have a sample rate of 48000, but has a sample rate of 16000 instead.

Worryingly, using the sample provided by seamless itself

wget https://dl.fbaipublicfiles.com/seamlessM4T/LJ037-0171_sr16k.wav -O LJ_eng.wav

results in the same 48000 sample error.

I am a bit confused. I wonder if you can reproduce this error with the same LJ_eng.wav official sample?

The following is the full error trace:


ValueError                                Traceback (most recent call last)

[<ipython-input-12-bf9336534118>](https://localhost:8080/#) in <cell line: 7>()
      6 preprocessed = preprocess_audio(in_file)
----> 7 text_output, _ = translator.predict(
      8     input=in_file,
      9     task_str="asr",

1 frames

[/usr/local/lib/python3.10/dist-packages/seamless_communication/inference/translator.py](https://localhost:8080/#) in predict(self, input, task_str, tgt_lang, src_lang, text_generation_opts, unit_generation_opts, spkr, sample_rate, unit_generation_ngram_filtering, duration_factor, prosody_encoder_input, src_text)
    291                     "format": -1,
    292                 }
--> 293             src = self.collate(self.convert_to_fbank(decoded_audio))["fbank"]
    294         else:
    295             if src_lang is None:

ValueError: The input waveform must have a sample rate of 48000, but has a sample rate of 16000 instead.
avidale commented 4 months ago

Your error is a result of a weird behavior of WaveformToFbankConverter in Fairseq2 (I have just reported it in https://github.com/facebookresearch/fairseq2/issues/341).

To work around it, you should feed the translator only with inputs waveforms of the same sample rate. Or, if this is a problem, you could re-initialize its fbank converter before running the translation:

from fairseq2.data.audio import WaveformToFbankConverter

translator.convert_to_fbank = WaveformToFbankConverter(

text_output, _ = translator.predict( ... # now do whatever you wanted with the translation
fumin commented 4 months ago

@avidale Excellent, applying your example runs the translation successfully! Thanks!

This way, I can continue my project which is processing the Legislative Yuan (sort of like US Congress) data. I do notice that seamless output is pretty high quality, but there's still a lot of improvement in this particular case of "government speech" + "Chinese".

For example, although the output 院长文勇委员反渗透法完成删读了 is 90% correct, it still mis-translates the last word, which is actually "三读". The wrong word means "cancel", but the correct word means "law officially released, voted, and passed", which are very different, and this difference is pretty significant in this government context.

Looks like we might need to do some fine-tuning for this domain. I noticed you are an expert, and I will be reading your great posts such as https://cointegrated.medium.com/how-to-fine-tune-a-nllb-200-model-for-translating-a-new-language-a37fc706b865

In the meantime, this seamless tutorial https://github.com/facebookresearch/seamless_communication/blob/main/Seamless_Tutorial.ipynb unfortunately doesn't touch on the training part, so I wonder if you could point to some relevant materials?

Again, thanks for help in this resolving this issue, and would be great if you could advise on the training side, too.

avidale commented 4 months ago

applying your example runs the translation successfully!

Great! I will consider the issue closed then :-)

I do notice that seamless output is pretty high quality, but there's still a lot of improvement in this particular case of "government speech" + "Chinese".

If I understood correctly, your task is purely speech recognition (the input and output are always the same language). Is this correct? If this is the case, the task is different from the speech translation, towards which the Seamless models are heavily optimized. So indeed, fine-tuning them solely on the ASR task would most likely bring quality improvements.

Currently, we don't have a well-developed tutorial on Seamless fine-tuning, but we have an example script that does perform it: https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/m4t/finetune. So please try it out on your in-domain data, if there is some! And if there are any questions or problems, don't hesitate to open new issues.