facebookresearch / fairseq2

FAIR Sequence Modeling Toolkit 2
https://facebookresearch.github.io/fairseq2/
MIT License
623 stars 62 forks source link

sample_rate defined as float in data/audio.py #9

Closed qmeeus closed 10 months ago

qmeeus commented 10 months ago

In src/fairseq2/data/audio.py AudioDecoderOutput, WaveformToFbankInput and WaveformToFbankOutput, the sample_rate is defined as float and not as an integer.

I think this might be an error, but git history shows that it used to be an integer. In every other library I know (espnet, fairseq, soundfile, librosa, torchaudio, etc.), the sample rate is assumed to be an integer, as it should be since it is the number of frames per second, which cannot be non integer.

Here is an example of a problem that could occur (and that I have personally experienced):

import torch
import soundfile as sf
from seamless_communication.models.inference import Translator

waveform, sample_rate = sf.read("any_audio_file.wav")
waveform = torch.from_numpy(waveform)
translator = Translator("seamlessM4T_medium", vocoder_name_or_card="vocoder_36langs", device=torch.device("cuda:0"), dtype=torch.float32)
translated_text, *_ = translator.predict(waveform, "s2st", "fra")
# ValueError: The input sample rate must be of type `float`, but is of type `int` instead.
cbalioglu commented 10 months ago

@qmeeus thanks for the feedback! Please see #12. WaveformToFbankConverter now accepts both float and integer sample rates. Note though that float sample rates are legitimate and there are use cases where they are used. In fact Kaldi (which torchaudio internally uses) accepts only floats as sample rate. Hope #12 resolves your issue though. I plan to release it as part of v0.1.1 tomorrow morning.

qmeeus commented 10 months ago

@cbalioglu Thank you for addressing this, and for your explanation !