Closed fumin closed 9 months ago
Hi! The code for the huggingface demo looks mostly like yours, but it uses audio preprocessing: https://huggingface.co/spaces/facebook/seamless-m4t-v2-large/blob/main/app.py#L74
An important part of the processing is resampling. Seamless expects 16000 sample rate, but the result of running your ffmpeg
produces an audio with 48000 samples per second.
If you resample the input to 16KHz, the output will be the same as in the HF demo.
@avidale
Thanks for your suggestion. However, using a 16KHz input results in the following error: ValueError: The input waveform must have a sample rate of 48000, but has a sample rate of 16000 instead.
Worryingly, using the sample provided by seamless itself
wget https://dl.fbaipublicfiles.com/seamlessM4T/LJ037-0171_sr16k.wav -O LJ_eng.wav
results in the same 48000 sample error.
I am a bit confused. I wonder if you can reproduce this error with the same LJ_eng.wav official sample?
The following is the full error trace:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-12-bf9336534118>](https://localhost:8080/#) in <cell line: 7>()
5
6 preprocessed = preprocess_audio(in_file)
----> 7 text_output, _ = translator.predict(
8 input=in_file,
9 task_str="asr",
1 frames
[/usr/local/lib/python3.10/dist-packages/seamless_communication/inference/translator.py](https://localhost:8080/#) in predict(self, input, task_str, tgt_lang, src_lang, text_generation_opts, unit_generation_opts, spkr, sample_rate, unit_generation_ngram_filtering, duration_factor, prosody_encoder_input, src_text)
291 "format": -1,
292 }
--> 293 src = self.collate(self.convert_to_fbank(decoded_audio))["fbank"]
294 else:
295 if src_lang is None:
ValueError: The input waveform must have a sample rate of 48000, but has a sample rate of 16000 instead.
Your error is a result of a weird behavior of WaveformToFbankConverter
in Fairseq2
(I have just reported it in https://github.com/facebookresearch/fairseq2/issues/341).
To work around it, you should feed the translator only with inputs waveforms of the same sample rate. Or, if this is a problem, you could re-initialize its fbank converter before running the translation:
from fairseq2.data.audio import WaveformToFbankConverter
translator.convert_to_fbank = WaveformToFbankConverter(
num_mel_bins=80,
waveform_scale=2**15,
channel_last=True,
standardize=True,
device=translator.device,
dtype=translator.dtype,
)
text_output, _ = translator.predict( ... # now do whatever you wanted with the translation
@avidale Excellent, applying your example runs the translation successfully! Thanks!
This way, I can continue my project which is processing the Legislative Yuan (sort of like US Congress) data. I do notice that seamless output is pretty high quality, but there's still a lot of improvement in this particular case of "government speech" + "Chinese".
For example, although the output 院长文勇委员反渗透法完成删读了
is 90% correct, it still mis-translates the last word, which is actually "三读". The wrong word means "cancel", but the correct word means "law officially released, voted, and passed", which are very different, and this difference is pretty significant in this government context.
Looks like we might need to do some fine-tuning for this domain. I noticed you are an expert, and I will be reading your great posts such as https://cointegrated.medium.com/how-to-fine-tune-a-nllb-200-model-for-translating-a-new-language-a37fc706b865
In the meantime, this seamless tutorial https://github.com/facebookresearch/seamless_communication/blob/main/Seamless_Tutorial.ipynb unfortunately doesn't touch on the training part, so I wonder if you could point to some relevant materials?
Again, thanks for help in this resolving this issue, and would be great if you could advise on the training side, too.
applying your example runs the translation successfully!
Great! I will consider the issue closed then :-)
I do notice that seamless output is pretty high quality, but there's still a lot of improvement in this particular case of "government speech" + "Chinese".
If I understood correctly, your task is purely speech recognition (the input and output are always the same language). Is this correct? If this is the case, the task is different from the speech translation, towards which the Seamless models are heavily optimized. So indeed, fine-tuning them solely on the ASR task would most likely bring quality improvements.
Currently, we don't have a well-developed tutorial on Seamless fine-tuning, but we have an example script that does perform it: https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/m4t/finetune. So please try it out on your in-domain data, if there is some! And if there are any questions or problems, don't hesitate to open new issues.
I have the following 5 second audio (it's a video because silly github does not support uploading audio, you can extract the audio by
ffmpeg -i short.mp4 -vn short.wav
):https://github.com/facebookresearch/seamless_communication/assets/765222/da7bd8e9-3a9a-409d-9cfa-a2918a247e93
It is in traditional Chinese.
When I use the following code in google colaboratory:
I get
一九三八年组织队在南川山头
, which is wrong.However the huggingface demo gives
院长文勇委员反渗透法完成删读了
, which is correct.How is it that using pytorch gives wrong results? Is it because Chinese support is not very good yet?