coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
34.7k stars 4.21k forks source link

[Bug] num_speakers set in python but is 0 in generated config.json #2058

Closed Ca-ressemble-a-du-fake closed 1 year ago

Ca-ressemble-a-du-fake commented 2 years ago

Describe the bug

Hi,

I am following the Multispeaker training documentation on VITS model. So I added the lines

# init speaker manager for multi-speaker training
# it maps speaker-id to speaker-name in the model and data-loader
speaker_manager = SpeakerManager()
speaker_manager.set_ids_from_data(train_samples + eval_samples, parse_key="speaker_name")
config.num_speakers = speaker_manager.num_speakers

I also checked the value of config.num_speakers which was correct (4 speakers) but then the generated config.json shows "num_speakers": 0 whereas it should be 4. Moreover in the end when synthetizing speech it does not list any speakers.

To Reproduce

Take a VITS recipe, set 4 datasets in LJSpeech layout with a custom formatter to retrieve a column with the speaker name as:

def caraduf(root_path, meta_file, **kwargs):  # pylint: disable=unused-argument
    """Normalizes the LJSpeech meta data file to TTS format with the speaker name written in the 2nd collumn
    https://keithito.com/LJ-Speech-Dataset/"""
    txt_file = os.path.join(root_path, meta_file)
    items = []
    speaker_name = "ljspeech"
    with open(txt_file, "r", encoding="utf-8") as ttf:
        for line in ttf:
            cols = line.split("|")
            wav_file = os.path.join(root_path, "wavs", cols[0] + ".wav")
            text = cols[2]
            speaker_name = cols[1]
            items.append({"text": text, "audio_file": wav_file, "speaker_name": speaker_name})
    return items

Add a speaker manager, print the reported number of speakers found (it should be 4), then launch the training.

Navigate to the output path and open the generated config.json. Scroll down to num_speakers field, it is 0 instead of 4.

Expected behavior

num_speakers in the generated config.json file should be set in accordance to what is set in python file (aka recipe).

Otherwise if config.num_speakers (in python) is only used for some prior computation and discarded afterwards then it should be stated somewhere : " X speakers detected but setting num_speakers to 0 in generated config.json because [eg you need to provide d vector file]"

Logs

No response

Environment

TTS version 0.8.0

Additional context

No response

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

erogol commented 1 year ago

I can't replicate the issue using the VCTK glow-tts recipe with the latest version. It shows 109 speakers in the resulting congif.json as it should be.