Open RobAley opened 2 years ago
This would be helpful for the community to create models for languages which are not currently supported. Some direction would be helpful on how to structure data/hours of audio needed/scripts to run for training/save model for future use.
No documentation yet. I'm currently rewriting the training code to be usable by people other than myself. It was written over the course of a year or more with different experiments and dead-ends left it. Definitely needs some clean up!
The structure of the training data is very simple, currently just a CSV file with two columns: (1) path or name of the audio file, and (2) text transcription. For example:
path/to/1.wav|This is a test.
path/to/2.wav|This is another test.
If you have multiple speakers, it becomes:
path/to/1.wav|speaker1|This is a test by speaker 1.
path/to/2.wav|speaker2|This is another test by speaker 2.
Eventually, I'd like to use the data format from Mimic Recording Studio. Audio files can be anything that librosa will load.
As for the amount of data, that depends if you're starting from scratch or will be reusing an existing model. From scratch, I've found that 3-5 hours will get you a good voice but 10+ will usually make a great voice. What really matters is the recording quality and phonetic diversity of what you read.
If you reuse an existing model, I've had as little as 30 minutes of data work using the Harvard Sentences. I'd recommend at least an hour, though.
Thanks for the quick reply. I am starting out to create a good quality TTS for Hindi so gathering info on what is required for a good dataset.
1) Does the length of audio and words in it matter? I saw LJ Speech Dataset has 1-10 seconds audio clips. The audio files I currently have are 1-3 min in duration each (with lot of repetition in words). 2) Should I create a fresh dataset (I have a studio available for recording) or split existing audios into sentences? (Would have to do this manually maybe but it is doable). 3) While creating fresh dataset, is there any open dataset for Hindi text transcripts that I should use? (similar to Harvard Sentences but in Hindi?)
Few questions above maybe out of scope of this repo, but if you could help it would be great.
Hi @inventionsbyhamid,
@synesthesiam do you have an update on getting the training code ready to use? I am interested in using it as well.
And it would be interesting to know, which model you are using or a reference to the paper ?
@synesthesiam I'd also love to make a voice model (in english). From what you've said on this thread, I think I could get started, but I'm just wondering what I would need to do with the CSV file once I have it made. Or maybe I'm more wondering if you are getting close to finishing the cleaning up of the training code, since I bet that would be easier to use than forging ahead alone. Or maybe better still, if Mimic Recording Studio is close to being ready to use for Mimic 3. Thanks for your work on this!
Hi,
Is there any documentation anywhere on how to train/create a new voice for this from e.g. audio collected by mimic-recording-studio?
Many thanks
Rob