Documentation for voice training?

RobAley commented 2 years ago

Hi,

Is there any documentation anywhere on how to train/create a new voice for this from e.g. audio collected by mimic-recording-studio?

Many thanks

Rob

inventionsbyhamid commented 2 years ago

This would be helpful for the community to create models for languages which are not currently supported. Some direction would be helpful on how to structure data/hours of audio needed/scripts to run for training/save model for future use.

synesthesiam commented 2 years ago

No documentation yet. I'm currently rewriting the training code to be usable by people other than myself. It was written over the course of a year or more with different experiments and dead-ends left it. Definitely needs some clean up!

The structure of the training data is very simple, currently just a CSV file with two columns: (1) path or name of the audio file, and (2) text transcription. For example:

path/to/1.wav|This is a test.
path/to/2.wav|This is another test.

If you have multiple speakers, it becomes:

path/to/1.wav|speaker1|This is a test by speaker 1.
path/to/2.wav|speaker2|This is another test by speaker 2.

Eventually, I'd like to use the data format from Mimic Recording Studio. Audio files can be anything that librosa will load.

As for the amount of data, that depends if you're starting from scratch or will be reusing an existing model. From scratch, I've found that 3-5 hours will get you a good voice but 10+ will usually make a great voice. What really matters is the recording quality and phonetic diversity of what you read.

If you reuse an existing model, I've had as little as 30 minutes of data work using the Harvard Sentences. I'd recommend at least an hour, though.

inventionsbyhamid commented 2 years ago

Thanks for the quick reply. I am starting out to create a good quality TTS for Hindi so gathering info on what is required for a good dataset.

1) Does the length of audio and words in it matter? I saw LJ Speech Dataset has 1-10 seconds audio clips. The audio files I currently have are 1-3 min in duration each (with lot of repetition in words). 2) Should I create a fresh dataset (I have a studio available for recording) or split existing audios into sentences? (Would have to do this manually maybe but it is doable). 3) While creating fresh dataset, is there any open dataset for Hindi text transcripts that I should use? (similar to Harvard Sentences but in Hindi?)

Few questions above maybe out of scope of this repo, but if you could help it would be great.

synesthesiam commented 2 years ago

Hi @inventionsbyhamid,

The length of audio does matter. Each clip should only be a sentence or two, and ideally you would have include clips with one or two words as well.
It's possible to get help splitting the audio with tools like aeneas and finetuneas
I don't know of any text dataset like that for Hindi. If you plan to use eSpeak for phonemization, I may be able to help create one with your assistance. I typically take sentences from the Oscar corpus and use a simple algorithm to create a phonetically balanced subset. I need help from native speakers, though, to figure out if the sentences make any sense :)

jyapayne commented 2 years ago

@synesthesiam do you have an update on getting the training code ready to use? I am interested in using it as well.

lumpidu commented 1 year ago

And it would be interesting to know, which model you are using or a reference to the paper ?

fivestones commented 1 year ago

@synesthesiam I'd also love to make a voice model (in english). From what you've said on this thread, I think I could get started, but I'm just wondering what I would need to do with the CSV file once I have it made. Or maybe I'm more wondering if you are getting close to finishing the cleaning up of the training code, since I bet that would be easier to use than forging ahead alone. Or maybe better still, if Mimic Recording Studio is close to being ready to use for Mimic 3. Thanks for your work on this!

MycroftAI / mimic3-voices

Documentation for voice training? #2