erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
1.09k stars 115 forks source link

Option to set training data/testing data percentages. #73

Closed Urammar closed 9 months ago

Urammar commented 9 months ago

Currently the software automatically divides the audio samples provided into a dataset, and then allocates some of that dataset for training and some of it for testing.

The problem here is that, while sub-optimal, the reality is that often the number of voice samples of the target isnt very high. (eg character from media without many minutes of spoken dialogue).

In such cases, it would be useful to be able to throw more samples at training, rather than testing. This will likely cause problems in training, of course, but might be the better option in rare cases.

erew123 commented 9 months ago

All the actual samples generated in Step 1 (Whisper splitting the original sample) are passed into the training and used in Step 2 (actual training). Its just that with the voice generation at the end (Step 3), you need something longer than 6 seconds to properly generate TTS (it wants a 6+ second long sample). So what's actually occurring at Step 3, all voice samples shorter than 7 seconds (just to be sure) are not being displayed, or copied over (Step 4/what to do next) alongside the model, as those shorter clips would be useless to put in your "voices" folder. I hope that makes sense, even I had to read it twice and I wrote it.

On more general point of having more/longer voice samples at the end, a few people have told me (so anecdotal) that the Whisper 2 model is splitting sentences better, both in how it cuts the sample wav's and the overall length. I've not enough time on my hands yet to fully test this, however I have made a note on the documentation for finetuning https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-finetuning-a-model

As a side note, many people seem to think that the Whisper v2 model (used on Step 1) is giving better results at generating training datasets, so you may prefer to try that, as opposed to the Whisper 3 model.

You may wish to try the Whisper 2 model (another 3GB download) and see how that fares for you.

erew123 commented 9 months ago

As we passed another message or two, Im assuming Ill be ok to close this. But feel free to reply if you want to know something else on this tickets topic. Thanks

Urammar commented 9 months ago

But feel free to reply if you want to know something else on this tickets topic. Thanks

You've misunderstood the ticket, sorry. I'm not talking about whisper incorrectly cutting up voice samples, im talking about those samples simply not existing. A minor character in a single episode of a tv show, for instance, that never performed again.

The computer in star trek is another example. There exists only a few minutes of the computers voice lines through the whole show, and only fraction of those that are clean without sirens and whatnot going in the background.

Now, the actual problem.


Whisper breaks down the voice samples, yes, but it populates those samples into two separate datasets.

The training dataset, and the evaluation dataset.

These two sets are intentionally separated, as otherwise the model can just train to the test, so to speak, but ultimately only being able to reproduce those exact samples, as it overfits.

The training and eval datasets are saved in finetune->tmp-trn as CSV files, and are not cross pollinated.

This behavior is absolutely correct for large voice sample sizes, but for something like a videogame character that only has a few minutes of spoken lines tops, this can cause problems as you have insufficient training material. The ability to choose to lower the amount of clips dedicated to evaluation and raise the number set for actual training would be a welcome feature.

In addition, it allows you to quickly and automatically add all possible voice clips to training as a final run, so overfitting is minimized, but training data is maximized.

erew123 commented 9 months ago

Ah, sorry, yes I did misunderstand your question. So you are on about the the ratio of the % split between evaluation and training data. Its currently set at 15% for evaluation and the remaining 85% is therefore set as training data. I can push a setting into the interface to allow you to adjust that, if I'm now getting your question correct?

erew123 commented 9 months ago

You want me to introduce...

image

correct?

Urammar commented 9 months ago

YES! Exactly that!

erew123 commented 9 months ago

Do a git pull.. it should be there!

Will also confirm at the prompt when you train:

image

Urammar commented 9 months ago

Absolute legend.