What is your question?
I am experiencing an issue with the pretrained neural network facebook/mms-tts-deu. When generating speech, it sometimes alternates between male and female voices, making the output unclear. How can I resolve this issue and generate speech with a single, consistent voice?
Initially, I used the following code and generated audio on CPU:
from transformers import VitsModel, AutoTokenizer
import torch
def generate_audio(transcription, language):
model_paths = {
"en": "/home/igor/NEURALNETWORK/facebook_mms_tts_eng",
"de": "/home/igor/NEURALNETWORK/facebook_mms_tts_deu"
}
model_path = model_paths.get(language)
# Loading model and tokenizer
model = VitsModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Example text
text = transcription
# Tokenizing input and generating waveform
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
output = model(**inputs).waveform
return output[0].numpy(), model.config.sampling_rate
The model generated good audio, but sometimes both male and female voices appear alternately. Here, I will upload an example audio file.
Then, I conducted experiments on Google Colab and tried changing speaker IDs:
import torch
from transformers import VitsModel, AutoTokenizer
import scipy.io.wavfile
import os
from huggingface_hub import login
# Authenticate to Hugging Face
login("##############################")
# Load model and tokenizer
model = VitsModel.from_pretrained("facebook/mms-tts-deu")
tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-deu")
# Check if the model supports multiple speakers
config = model.config
supports_speaker_id = hasattr(config, 'num_speakers') and config.num_speakers > 1
# Example text input
text = "Hallo, wie geht es Ihnen heute?"
# Tokenize input
inputs = tokenizer(text, return_tensors="pt")
# Set speaker_id if supported
if supports_speaker_id:
speaker_id = torch.tensor([1]) # Example: 0 - male voice, 1 - female voice
inputs["speaker_id"] = speaker_id
# Generate waveform
with torch.no_grad():
output = model(**inputs).waveform
# Ensure the output is in the right shape and format for audio playback
waveform = output.squeeze().numpy()
# Define the output path
output_dir = "/content/gdrive/MyDrive/TextToSpeech/TestSpeech"
output_path = os.path.join(output_dir, "output.wav")
# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)
# Save waveform as .wav file
scipy.io.wavfile.write(output_path, rate=model.config.sampling_rate, data=waveform)
print(f"Audio saved at: {output_path}")
# Playback the audio in the notebook
from IPython.display import Audio
Audio(waveform, rate=model.config.sampling_rate)
I changed the speaker ID to 0 and 1, but I couldn't generate a female voice, as it always produced a male voice. How can I control the voices in this model?
What's your environment?
fairseq Version: 1.0
PyTorch Version: Tried different versions
OS: Linux
How you installed fairseq: pip
Build command you used (if compiling from source): N/A
Python version: Tried different versions (3.9, 3.10)
CUDA/cuDNN version: N/A (CPU usage)
GPU models and configuration: N/A (CPU usage)
Any other relevant information: None
What is your question? I am experiencing an issue with the pretrained neural network facebook/mms-tts-deu. When generating speech, it sometimes alternates between male and female voices, making the output unclear. How can I resolve this issue and generate speech with a single, consistent voice?
Initially, I used the following code and generated audio on CPU:
The model generated good audio, but sometimes both male and female voices appear alternately. Here, I will upload an example audio file.
Then, I conducted experiments on Google Colab and tried changing speaker IDs:
I changed the speaker ID to 0 and 1, but I couldn't generate a female voice, as it always produced a male voice. How can I control the voices in this model?
What's your environment? fairseq Version: 1.0 PyTorch Version: Tried different versions OS: Linux How you installed fairseq: pip Build command you used (if compiling from source): N/A Python version: Tried different versions (3.9, 3.10) CUDA/cuDNN version: N/A (CPU usage) GPU models and configuration: N/A (CPU usage) Any other relevant information: None