facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.53k stars 1.02k forks source link

fairseq2 expects a float for sample rate #94

Closed qmeeus closed 10 months ago

qmeeus commented 10 months ago

In src/fairseq2/data/audio.py AudioDecoderOutput, WaveformToFbankInput and WaveformToFbankOutput, the sample_rate is defined as float and not as an integer.

This leads to an error when executing this code sample:

import torch
import soundfile as sf
from seamless_communication.models.inference import Translator

waveform, sample_rate = sf.read("any_audio_file.wav")
waveform = torch.from_numpy(waveform)
translator = Translator("seamlessM4T_medium", vocoder_name_or_card="vocoder_36langs", device=torch.device("cuda:0"), dtype=torch.float32)
translated_text, *_ = translator.predict(waveform, "s2st", "fra")
# ValueError: The input sample rate must be of type `float`, but is of type `int` instead.

Although I think this is a mistake in fairseq2 code base, I report it here as well since fairseq2 is listed as a dependency. The fix, should they decide not to fix it, is to cast the sample_rate as a float in this line

I have reported the issue in fairseq2 repo as well

YashasviMantha commented 10 months ago

Interesting. Have you perhaps tried directly giving the path to the wav file to the predict method? Also, you can resample like this as well:

resample_rate = 16000
waveform, sample_rate = torchaudio.load('./source.wav')
resampler = torchaudio.transforms.Resample(sample_rate, resample_rate, dtype=waveform.dtype)
qmeeus commented 10 months ago

That is not really the point of the issue. What if I want to process the waveform before passing it to the model? The real solution to make the example run is to provide a float value to predict, something around those lines:

translated_text, *_ = translator.predict(waveform, "s2st", "fra", sample_rate=float(sample_rate))

but this does not address the underlying problem.

Thank you for the suggestion

cbalioglu commented 10 months ago

(Copy pasting from fairseq2 issue)

@qmeeus thanks for the feedback! Please see #12. WaveformToFbankConverter now accepts both float and integer sample rates. Note though that float sample rates are legitimate and there are use cases where they are used. In fact Kaldi (which torchaudio internally uses) accepts only floats as sample rate. Hope #12 resolves your issue though. I plan to release it as part of v0.1.1 tomorrow morning.