Facebook/mms-tts-deu speaks two voices at once, male and female

What is your question? I am experiencing an issue with the pretrained neural network facebook/mms-tts-deu. When generating speech, it sometimes alternates between male and female voices, making the output unclear. How can I resolve this issue and generate speech with a single, consistent voice?

Initially, I used the following code and generated audio on CPU:

from transformers import VitsModel, AutoTokenizer
import torch

def generate_audio(transcription, language):
    model_paths = {
        "en": "/home/igor/NEURALNETWORK/facebook_mms_tts_eng",
        "de": "/home/igor/NEURALNETWORK/facebook_mms_tts_deu"
    }
    model_path = model_paths.get(language)

    # Loading model and tokenizer
    model = VitsModel.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # Example text
    text = transcription  

    # Tokenizing input and generating waveform
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        output = model(**inputs).waveform

    return output[0].numpy(), model.config.sampling_rate

The model generated good audio, but sometimes both male and female voices appear alternately. Here, I will upload an example audio file.

Then, I conducted experiments on Google Colab and tried changing speaker IDs:

import torch
from transformers import VitsModel, AutoTokenizer
import scipy.io.wavfile
import os
from huggingface_hub import login

# Authenticate to Hugging Face
login("##############################")

# Load model and tokenizer
model = VitsModel.from_pretrained("facebook/mms-tts-deu")
tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-deu")

# Check if the model supports multiple speakers
config = model.config
supports_speaker_id = hasattr(config, 'num_speakers') and config.num_speakers > 1

# Example text input
text = "Hallo, wie geht es Ihnen heute?"

# Tokenize input
inputs = tokenizer(text, return_tensors="pt")

# Set speaker_id if supported
if supports_speaker_id:
    speaker_id = torch.tensor([1])  # Example: 0 - male voice, 1 - female voice
    inputs["speaker_id"] = speaker_id

# Generate waveform
with torch.no_grad():
    output = model(**inputs).waveform

# Ensure the output is in the right shape and format for audio playback
waveform = output.squeeze().numpy()

# Define the output path
output_dir = "/content/gdrive/MyDrive/TextToSpeech/TestSpeech"
output_path = os.path.join(output_dir, "output.wav")

# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)

# Save waveform as .wav file
scipy.io.wavfile.write(output_path, rate=model.config.sampling_rate, data=waveform)

print(f"Audio saved at: {output_path}")

# Playback the audio in the notebook
from IPython.display import Audio
Audio(waveform, rate=model.config.sampling_rate)

I changed the speaker ID to 0 and 1, but I couldn't generate a female voice, as it always produced a male voice. How can I control the voices in this model?

What's your environment? fairseq Version: 1.0 PyTorch Version: Tried different versions OS: Linux How you installed fairseq: pip Build command you used (if compiling from source): N/A Python version: Tried different versions (3.9, 3.10) CUDA/cuDNN version: N/A (CPU usage) GPU models and configuration: N/A (CPU usage) Any other relevant information: None

facebookresearch / fairseq

Facebook/mms-tts-deu speaks two voices at once, male and female #5498