coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.08k stars 4.27k forks source link

Voices for VCTK-VITS don't match the speaker metadata in VCTK corpus's speaker-info.txt #2258

Closed xenotropic closed 1 year ago

xenotropic commented 1 year ago

Describe the bug

The VCTK-VITS model contains 109 voices. If, after installing TTS, I type tts --model_name "tts_models/en/vctk/vits" --list_speaker_idxs then tts gives a list of these voices, with serial numbers and identifiers of the form "p225" through "p376".

These appear to be from the VCTK corpus. The only place I could find these listed (at least without downloading a 10G corpus) was in this speaker-info.txt file, which you can see with more context here. It is the VCTK corpus, of course (there are superseding publications, but none have a separate speaker_info.txt file), and the identifiers go from 225 to 376.

But . . . the metadata in that speaker-info.txt doesn't seem to match the VTT output. I wrote a quick script to make a sample output of each one, and they all are different, but they don't seem to match up with the speaker-info.txt metadata, except occasionally enough that it seems like chance.

To Reproduce

Run

tts --model_name "tts_models/en/vctk/vits" --out_path "vctk-vits-p223.wav" --speaker_idx "p233" --text "Hi I'm speaker number 223. I am female. My age is 23. My accent is English from Staffordshire."

The metadata given in the text parameter is based on the speaker-info.txt in the VCTK paper linked above.

Expected behavior

I would expect a female voice speaking the text with an English accent. Instead it is a male voice speaking with what seems like an Indian accent. Many others don't match.

Logs

No response

Environment

TTS 0.10.1 from pip 
torch Version: 1.13.1
Ubuntu 22.04.01

Additional context

It's quite possible I've just got the wrong metadata file -- maybe they reorganized the corpus later on, in a way that was used when VITS was generated. But I can't find an easy way to find that information without downloading the 10G corpus, which is quite slow. What I really want to do here is just document the voices in VTT's usage of VCTK-VITS -- so that one can know the meta-information about any given voice, and find a desired accent type for use with vtt easily. So what I'm looking for is a speaker-info.txt that lines up with the voices that VTT uses with VITS-VCTK. Thanks!

erogol commented 1 year ago

With that model, we stirred the speaker names due to a bug at the time. There is no easy fix since we need to know how it is mixed. So until we have a new model, it is what it is. I just wanted to let you know that I closed this since there is nothing we can do now. But still keeping the issue is helpful for similar people. Thanks for the problem.

tissatussa commented 1 year ago

[can i comment on a closed issue? ] hi, this issue helped me to use more voices! i read that speakers-info.txt and with your command line i created .wav files of some of those voices to hear their performance!

now i'm wondering : in the meantime, did you figure out the real naming and info ? Does an accurate list exist ? I'm not familiar with these techniques yet, can you point me to resources of voices ?

xenotropic commented 1 year ago

I think the answer is "no", because the information was destroyed. The closest you can do is write a script that generates a speech sample for every speaker and try to classify it again (by gender, region, etc). See discussion at https://github.com/coqui-ai/TTS/discussions/1891#discussioncomment-5648259